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Chapter 1 

Basic Elements of Statistical Decision 
Theory and Statistical Learning Theory 1 



Throughout this module, let X denote the input to a decision-making process and Y denote the correct 
response or output (e.g., the value of a parameter, the label of a class, the signal of interest). We assume 
that X and Y are random variables or random vectors with joint distribution Px,y {x,y), where x and y 
denote specific values that may be taken by the random variables X and Y, respectively. The observation X 
is used to make decisions pertaining to the quantity of interest. For the purposes of illustration, we will focus 
on the task of determining the value of the quantity of interest. A decision rule for this task is a function / 
that takes the observation X as input and outputs a prediction of the quantity Y. We denote a decision rule 

by Y or / (X), when we wish to indicate explicitly the dependence of the decision rule on the observation. 
We will examine techniques for designing decision rules and for analyzing their performance. 

1.1 Measuring Decision Accuracy: Loss and Risk Functions 

The accuracy of a decision is measured with a loss function. For example, if our goal is to determine the 
value of Y, then a loss function takes as inputs the true value Y and the predicted value (the decision) 

Y= f {X) and outputs a non-negative real number (the "loss") reflective of the accuracy of the decision. Two 
of the most commonly encountered loss functions include: 



1. 0/1 loss: £q/i I Y, Y I = I~ , which is the indicator function taking the value of 1 when Y=/= Y and 

V / y^Y 

taking the value when Y {X) = Y. 

2. squared error loss: £2 ( y, ^ ) = \\Y —Y \\\, which is simply the sum of squared differences between 

the elements of Y and Y. 

The 0/1 loss is commonly used in detection and classification problems, and the squared error loss is more 
appropriate for problems involving the estimation of a continuous parameter. Note that since the inputs to 
the loss function may be random variables, so is the loss. 

A risk R (/) is a function of the decision rule /, and is defined to be the expectation of a loss with respect 
to the joint distribution Px,y {%, y)- For example, the expected 0/1 loss produces the probability of error 
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risk function; i.e., a simply calculation shows that -Ro/i (/) = E [(l/(x)^y] = Pr (/ (X) ^ Y) . The expected 
squared error loss produces the mean squared error MSE risk function, Ri (/) = E [|| / (X) — Y |||]. 

Optimal decisions are obtained by choosing a decision rule / that minimizes the desired risk function. 
Given complete knowledge of the probability distributions involved (e.g., Px,y (%, v)) one can explicitly or 
numerically design an optimal decision rule, denoted /*, that minimizes the risk function. 

1.2 The Maximum Likelihood Principle 

The conditional distribution of the observation X given the quantity of interest Y is denoted by Px\y i x \v)- 
The conditional distribution Px\y i x \v) can be viewed as a generative model, probabilistically describing the 
observations resulting from a given value, y, of the quantity of interest. For example, if y is the value of 
a parameter, the Px\y ( x \y) ls the probability distribution of the observation X when the parameter value 
is set to y. If X is a continuous random variable with conditional density px\Y ( x \y) or a discrete random 
variable with conditional probability mass function (pmf) px\Y ( x \y), then given a value y we can assess the 
probability of a particular measurment value y by the magnitude of either the conditional density or pmf. 

In decision making problems, we know the value of the observation, but do not know the value y. 
Therefore, it is appealing to consider the conditional density or pmf as a function of the unknown values y, 
with X fixed at its observed value. The resulting function is called the likelihood function. As the name 
suggests, values of y where the likelihood function is largest are intuitively reasonable indicators of the true 
value of the unknown quantity, which we will denote by y* . The rationale for this is that these values would 
produce conditional densities or pmfs that place high probability on the observation X = x. 

The Maximum Likelihood Estimator (MLE) is defined to be the value of y that maximizes the likelihood 
function; i.e., in the continuous case 

V {X) = argmaxp X \Y (X\y) (1.1) 

with an analogous definition for the discrete case by replacing the conditional density with the conditional 

pmf. The decision rule V (X) is called an "estimator," which is common in decision problems involving a 
continuous parameter. Note that maximizing the likelihood function is equivalent to minimizing the negative 
log-likelihood function (since the logarithm is a monotonic transformation). Now let y* denote the true value 
of Y. Then we can view the negative log-likelihood as a loss function 

ti.(v,V*) = - log Px \Y (X\y) (1.2) 

where the dependence on y* on the right hand side is embodied in the observation X on the left. An 
interesting special case of the MLE results when the conditional density Px\y (X\y) is a Gaussian, in which 
case the negative log-likelihood corresponds to a squared error loss function. 

Now let us consider the expectation of this loss, with respect to the conditional distribution Px\y {X\y*): 

-E [logp x \Y (X\y)] = flog [ Px ^ (xly) ) Px\y (x\y*) dx (1.3) 

The true value y* minimizes the expected negative log-likelihood (or, equivalently, maximizes the expected 
log-likelihood ). To see this, compare the expected log-likelihood of y* with that of any other value y: 

E[logpx\Y(X\y*)-logp X \Y(X\y)} = E [log fe'^ffffi )] 

= KL(px\ Y { x \y*),Px\Y {Ay)) 

The quantity KL (px\Y ( x \y*) >Px\y ( x \y)) ls called the Kullback-Leibler (KL) divergence between the con- 
ditional density function Px\y ( x \y*) an d Px\y ( x \y)- The KL divergence is non-negative, and zero if and 



only if the two densities are equal [1]. So, we see that the KL divergence acts as a sort of risk function in 
the context of Maximum Likelihood Estimation. 

1.3 The Cramer-Rao Lower Bound 

The MLE is based on finding the value for Y that maximizes the likelihood function. Intuitively, if the 
maximum point is very distinct, say a well isolated peak in the likelihood function, then the easier it will be 
to distinguish the MLE from alternative decisions. Consider the case in which Y is a scalar quantity. The 

"peakiness" of the log-likelihood function can be gauged by examining its curvature, ° g ^ x| 2 y , at the 

point of maximum likelihood. The higher the curvature, the more peaky is the behavior of the likelihood 
function at the maximum point. Of course, we hope that the MLE will be a good predictor (decision) 
for the unknown true value y* . So, rather than looking at the curvature of the log-likelihood function at 
the maximum likelihood point, a more appropriate measure of how easily it will be to distinguish y* from 
the alternatives is the expected curvature of the log-likelihood function evaluated at the value y* . The 
expectation taken over all possible observations with respect to the conditional density px\y ( x \y*)- This 
quantity, denoted I (y*) = E °9Px\Y(x\y) \ y=y ,^ [ s called the Fisher Information (FI). In fact, the FI 

provides us with an important performance bound known as the Cramer- Rao Lower Bound (CRLB). 

The CRLB states that under some mild regularity assumptions about the conditional density function 
Px\y ( x \y)i the variance of any unbiased estimator is bounded from below by the inverse of the / (y*)[5], [4], 



[3]. Recall that an unbiased estimator is any estimator Y that satisfies E 
that 



Y 



y*. The CRLB tells us is 



var(y) > jfa. (1.5) 

If Y is a vector-valued quantity, then the expected negative Hessian matrix (matrix of partial second 
derivatives) of the log-likelihood function is called the Fisher Information Matrix (FIM), and a similar 
inequality tells us that the variance of each component of any unbiased estimator of y* is bounded below by 
the corresponding diagonal element of the inverse of the FIM. Since the MSE of an unbiased estimator is 
equal to its variance, we see that the CRLB provides a very useful lower bound on the best MSE performance 
that we can hope to achieve. Thus, the CRLB is often used as a comparison point for evaluating estimators. 
It may or may not be possible to achieve the CRLB, but if we find a decision rule that does, we know 
that it also minimizes the MSE risk among all possible unbiased estimators. In general, it may be difficult 
to compute the CRLB, but in certain important cases it is possible to find closed-form or computational 
solutions. 

1.4 Bayesian Decision Theory 

Bayesian Decision Theory provides a formal system for integrating prior knowledge and observed obser- 
vations. For the purposes of illustration we will focus on problems involving continuous variables and 
observations, but extensions to discrete cases are straightforward (simple replace probability densities with 
probability mass functions, and integrals with summations). The key elements of Bayesian methods are: 

1. a prior probability density function py (y) describing a priori knowledge of probable states for the 
quantity Y; 

2. the likelihood function Px\y { x \v\ as described above; 

3. the posterior density function py\x (y\ x )- 
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The posterior density is a function of the prior and likelihood, obtained according to Bayes rule: 

, , > Vx\y{Av)Vy{v) . . 

J Px|y (aj|j/)py (y) ay 

The posterior is an indicator of probable values for Y, based on the prior knowledge and the observation. 
Several options exist for deriving a specific estimate of Y using the posterior. The mean value of the posterior 
density is one common choice (commonly called the posterior mean). The posterior mean is the decision 
rule that minimizes the expected squared error loss (MSE risk) function. The value y where the posterior 
density is maximized is another popular estimator (commonly called the Maximum A Posteriori (MAP) 
estimator). Note that the denominator of the posterior is independent of y, so the MAP estimator is simply 
the maximizer of the product of the likelihood and the prior. Therefore, if the prior is a constant function, 
the MAP estimator and MLE coincide. 

1.5 Statistical Learning 

In all of the methods described above, we assumed some amount of knowledge about the distributions of the 
observation X and quantity of interest Y. Such knowledge can come from a careful analysis of the physical 
characteristics of the problem at hand, or it can be gleaned from previous experience. However, there are 
situations where it is difficult to model the physics of the problem and we may not have enough experience 
to develop complete and accurate probability models. In such cases, it is natural to adopt a statistical 
learning approach [2], [7]. 

Statistical learning methods are based on developing decision rules or estimators based only on a collection 
of training examples, rather than predetermined probability models. Statistical learning methods are often 
said to be distribution-free, since they do not assume particular probability models. The canonical set-up 
for statistical learning is as follows. We begin with a collection of training examples, {(X i ,Y i )}2 =1 , which are 
assumed to be independently and identically distributed according to an unknown probability distribution 
Px,y {x,y). If we knew Px,y {x,y), then we could compute a desired risk function and design an optimal 
decision rule using the methods described above. In essence, the training examples give us a glimpse at the 
underlying distribution, but our knowledge of it is far from complete. We cannot exactly compute a risk 
function, and therefore we cannot derive a corresponding optimal decision rule. 

There are at least two ways to proceed at this point. One possibility is to use the training examples to 
estimate the joint probability distribution, and then use this estimate to derive an decision rule. Unfortu- 
nately, the (general-purpose) problem of estimating a distribution is often more difficult from a limited pool 
of data than is the problem of designing a specific-purpose decision rule. For this reason, a second possibility 
is more commonly favored in practice. Rather than estimating the complete distribution, one can use the 
training examples to directly design a decision rule. More precisely, perhaps the most common approach is 
to use the training examples to compute an estimate of the desired risk function. 

Suppose that we are interested in minimizing a particular risk function. Recall that the risk is the 



expected value of a chosen loss function. Let 1 1 Y, Y j denote the loss, and let / (X) denote a candidate 

decision function, mapping observations to predictions about Y (i.e., Y= f (X)). The empirical risk 
function is constructed from the training examples as follows: 

n 

R(f) = -£V (/(**), is). (i.7) 

j=l 

This is simply the average loss of the decision rule / over the set of training examples. Note that since the 
training examples are independent and identically distributed, the expected value of the empirical risk is 
equal to the true risk R (/) = E [£ (/ (X) , Y)}. Moreover, we known (according to the law of large numbers) 



that the empirical risk tends to the true risk as the size of the training sample increases. These facts lend 
support to the idea of choosing a decision rule to minimize the empirical risk. 

Empirical risk minimization (ERM) is just this process. Given a collection of possible decision rules, say 
T, ERM selects a decision rule according to 

fn = argmin R (/) . (1.8) 

The selected rule, /„, obviously depends on the given set of training examples, and therefore it is itself a 

random quantity. The theoretically optimal counterpart to f n is the decision rule that minimizes the true 
risk 

/* = argminR(f) . (1-9) 

The central problem in statistical learning is to quantify how close f n performs relative to /*. Note that 

R(f*) < R I /„ I , since /* minimizes the true risk. Thus, one way to gauge the performance of f n relative 

to /* is to show that there exists small positive values e and 6 such that with probability at least 1 — 5 we 
have 

RUn) < R(f*) + e. (1.10) 

If an inequality of this form holds, then we say that f n is a Probability Approximately Correct (PAC) 
decision rule [6]. 

To show that the empirical risk minimizer is a PAC decision rule, we first must understand how closely 
the empirical risk matches the true risk. First, let us consider the empirical and true risk of the decision rule 
/. Assume that the loss function is bounded between and 1 (possibly after a suitable normalization). Then 
the empirical risk function is a sum of independent random variables bounded between and 1. Hoeffding's 
inequality is a bound on the deviations of such random sums from their corresponding mean values [2]. In 
this case, the mean value is the true risk of /, and Hoeffding's inequality states that 

-2ns" 



P(\R(f)-R(f)\>e) < 2e~^ . (L11) 

Another equivalent statement is that the inequality | R (/) — R (f) \ < e holds with probability at least 
1 — 2e _2ne . Thus, the two risks are probably close together, and the greater the number of training examples, 
n, the closer they are. 

Now we would like a similar condition to hold for all / G T , since ERM optimizes over the entire collection 
T. Suppose that T is a finite collection of decision rules. Let \T\ denote the number of rules in T '. The 
probability that the difference between the true and empirical risks, of one or more of the decision rules, 

exceeds e is bounded by the sum of the probabilities of each individual event of the form | R (/) — R (f) \ > e, 
the so-called Union of Events bound. Therefore, with probability at least 1 — \T\2e~ 2ne we have that 

I R(f)-R (/)|<e (1-12) 
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for all / G T. Equivalently, setting 5 = 2\T\e~ 2ne , we have that with probability at least 1 — 5 and for all 



| S (/.-«(i.l< /»». (1-13) 

V 2n 

Notice that the two risks are uniformly close together, and the closeness indicated by the bound increases 
as n increases and decreases as the number of decision rules in T increases. In fact, the bound scales with 
log\T\, and so it is reasonable to interpret the logarithm of the number of decision rules under consideration 
as a measure of the complexity of the class. 

Now using this bound, we can show that f n is a PAC decision rule as follows. Note that with probability 
at least 1 — 5 

R \fr x , , 

(1.14) 



< 


R[f n u^ io9m+ >: 9(2,5) 


< 


R{n + ^lo g \F\+lo g{ 2/S) 


< 


R{n + 2 /lo 9 \r\+lo g (2/S) 



where the first inequality follows since the true and empirical risks are close for all / € T, and in particular for 

f n , the second inequality holds since by definition f n minimizes the empirical risk, and the third inequality 
holds again since the empirical risk is close to the true risk for all /, in this case for /* in particular. So, we 

have shown that f n is PAC. 

PAC bounds of this form can be extended in many directions, for example to infinitely large or uncountable 
classes of decision rules, but the basic ingredients of the theory are essentially like those demonstrated above. 
The bottom line is that empirical risk minimization is a reasonable approach, provided one has access to 
a sufficient number of training examples and the number, or more generally the complexity, of the class of 
decision rules under consideration is not too great. 

1.6 Further reading 

Excellent treatments of classical decision and estimation theory can be found in a number of textbooks [5] , 
[4], [3], [1]. For references on statistical learning theory, outstanding textbooks are also available [2], [7], [6] 
for further reading. 



Chapter 2 

Elements of Statistical Learning Theory 



2.1 Three Elements of Statistical Data Analysis 

1. Probabilistic Formulation: of learning from data and prediction problems. 

2. Performance Characterization:: • concentration inequalities 

• uniform deviation bounds 

• approximation theory 

• rates of convergence 

3. Practical Algorithms: that run in polynomial time (e.g., decision trees, wavelet methods, support 

vector machines). 

2.2 Learning from Data 

To formulate the basic learning from data problem, we must specify several basic elements: data spaces, 
probability measures, loss functions, and statistical risk. 

2.2.1 Data Spaces 

Learning from data begins with a specification of two spaces: 

X = Input Space (2.1) 

y = Output Space. (2.2) 

The input space is also sometimes called the "feature space" or "signal domain." The output space is also 
called the "class label space," "outcome space," "response space," or "signal range." 

Example 2.1 

X = R d d-dimensional Euclidean space of "feature vectors" (2-3) 

y = {0,1} two classes or "class labels" (2-4) 



1 This content is available online at <http://cnx.Org/content/ml6269/l.2/>. 
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Example 2.2 

X = R one-dimensional signal domain (e.g., time-domain) (2-5) 

y = R real-valued signal (2-6) 

A classic example is estimating a signal / in noise: 

Y=f(X) + W (2.7) 

where X is a random sample point on the real line and W is a noise independent of X. 

2.2.2 Probability Measure and Expectation 

Define a joint probability distribution on X x y denoted Px,y- Let (X, Y) denote a pair of random variables 
distributed according to Px.y- We will also have use for marginal and conditional distributions. Let Px 
denote the marginal distribution on X, and let Py\x denote the conditional distribution of Y given X. 
For any distribution P, let p denote its density function with respect to the corresponding dominating 
measure; e.g., Lebesgue measure for continuous random variables or counting measure for discrete 
random variables. 

Define the expectation operator: 

E X .Y [f (X, Y)]=Jf (x, y) dP x ,Y (x, y) = J f (x, y) Px ,y (x, y) dxdy. (2.8) 

We will also make use of corresponding marginal and conditional expectations such as Ex and Ey\x- 

Wherever convenient and obvious based on context, we may drop the subscripts (e.g., E instead of Ex,y) 
for notational ease. 

2.2.3 Loss Functions 

A loss function is a mapping 

£ : y x y h-» R. (2.9) 

Example 2.3 

In binary classification problems, y = {0,1}. The 0/1 loss function is usually used: £ (3/1,3/2) = 
lj/ 17 tj/ 2 , where 1a is the indicator function which takes a value of 1 if condition A is true and zero 

otherwise. We typically will compare a true label y with a prediction y, in which case the 0/1 loss 
simply counts misclassifications. 

Example 2.4 

In regression or estimation problems, y = R. The squared error loss function is often employed: 
^(3/1,3/2) = (2/1 — 3/2)2, the square of the difference between 3/1 and 3/2. In application, we are 

interested in a true value 3/ in comparison to an estimate y. 



2.2.4 Statistical Risk 

The basic problem in learning is to determine a mapping / : X \— > y that takes an input x e X and predicts 
the corresponding output y e y. The performance of a given map / is measured by its expected loss or risk: 

R(f) = E XiY [l(f(X),Y)]. (2.10) 

The risk tells us how well, on average, the predictor / performs with respect to the chosen loss function. A 
key quantity of interest is the mininum risk value, defined as 

R*=infR(f) (2.11) 

/ 

where the infinum is taking over all measurable functions. 

2.2.5 The Learning Problem 

Suppose that (X,Y) are distributed according to Px,y ((X,Y) ~ Px.y f° r short). Our goal is to find a map so 
that / (X) w Y with high probability. Ideally, we would chose / to minimize the risk R (/) = E [£ (/ (X) , Y)}. 
However, in order to compute the risk (and hence optimize it) we need to know the joint distribution Px,y- 
In many problems of practical interest, the joint distribution is unknown, and minimizing the risk is not 
possible. 

Suppose that we have some exemplary samples from the distribution. Specifically, consider n samples 
Xi,Yi2 =1 distributed independently and identically (iid) according to the otherwise unknown Px,y- Let us 
call these samples training data, and denote the collection by D n = Xi, Yif =1 . Let's also define a collection 
of candidate mappings T. We will use the training data D n to pick a mapping f n € T that we hope will be 
a good predictor. This is sometimes called the Model Selection problem. Note that the selected model /„ 
is a function of the training data: 

f n (X) = f(X;D n ), (2.12) 

which is what the subscript n in /„ refers to. The risk of /„ is given by 

R(f n ) = E x ,Y[t(U(X),Y)]. (2.13) 

Note that since /„ depends on D n in addition to a new random pair (X,Y), the risk is a random variable 
(i.e., a function of the training data D n ). Therefore, we are interested in the expected risk, computed over 
random realizations of the training data: 

E Dn [R(fn)}- (2.14) 

We hope that /„ produces a small expected risk. 

The notion of expected risk can be interpreted as follows. We would like to define an algorithm (a model 
selection process) that performs well on average, over any random sample of n training data. The expected 
risk is a measure of the expected performance of the algorithm with respect to the chosen loss function. That 
is, we are not gauging the risk of a particular map / g T, but rather we are measuring the performance of 
the algorithm that takes any realization of training data and selects an appropriate model in T. 

This course is concerned with determining "good" model spaces T and useful and effective model selection 
algorithms. 
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Chapter 3 

Introduction to Classification and 
Regression 1 

3.1 Pattern Classification 

Recall that the goal of classification is to learn a mapping from the feature space, X, to a label space, y. 
This mapping, /, is called a classifier. For example, we might have 



x = n d 

y = {0,1}. 

We can measure the loss of our classifier using 0—1 loss; i.e., 



(3.1) 



l(i.v)-l. -I 1 ' *". (3.2) 

v / {»*»} 0, V=y 

Recalling that risk is defined to be the expected value of the loss function, we have 

R (f) = E XY [£ (/ (X) , Y)] = E XY [l {n x)^Y}] = Pxy (/ (X) ± Y) . (3.3) 

The performance of a given classifier can be evaluated in terms of how close its risk is to the Bayes' risk. 

Definition 3.1: (Bayes' Risk) 

The Bayes' risk is the infimum of the risk for all classifiers: 

R*=infR(f). (3.4) 

/ 

We can prove that the Bayes risk is achieved by the Bayes classifier. 

Definition 3.2: Bayes Classifier 

The Bayes classifier is the following mapping: 

1, 77 (x) > 1/2 

/*(*) = { l{ '- ' (3.5) 

0, otherwise 

where 

r,(x)=P Y \x(Y = l\X = x). (3.6) 



1 This content is available online at <http://cnx.Org/content/ml6272/l.2/>. 
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>0 

and for x such that r\ (x) < 1/2, we have 



(3.9) 
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Note that for any x, f* (x) is the value of y G {0, 1} that maximizes Pxy (Y = y\X = x). 
Theorem 3.1: Risk of the Bayes Classifier 

R(f*)=R*. (3.7) 

Proof: 

Let g (x) be any classifier. We will show that 

P (g (X) ^Y\X = x)>P (/* (x) + Y\X = x) . (3.8) 

For any g, 

P(g(X)^Y\X = x) = l-P(Y = g(X)\X = x) 

= 1 - [P (Y = 1, g (X) = 1\X = x) + P (Y = 0, g (X) = 0\X = x)} 

= l-[E [l {Y=1} l {g(x)=1} \X = x]+E [l { Y=0}l{g(X)=0}\X = x]] 

= 1 ~ [ 1 {g(x)=i} E [l{y=i}l^ = x] + l{ g ( x ) =0 }E [l {y=0 }|X = x\\ 
= 1 - [l {g(x) =i } P (Y=l\X = x) + l {g(x)=0} P (Y = Q\X = x)} 

1 ~ [ 1 {g(x)=i}'n ( x ) + 1 {g(x)=0} 0--V 0))] 
Next consider the difference 

P (g (x) ^Y\X = x)-P (/* (a;) + Y\X = x) 



V (X) [l {/ . (x) = i } - l{ fl (x)=l}] +0--V 0)) [ 1 {/'(x)=0} - 1 { 9 (x)=0}] 

(3.10) 

V (X) [l { /. (x ) = l } - l{ fl (x) = l}] -Q--V 0)) [l {/ » (x) = l } - l{g(x) = l}] 

(2r)(x) - 1) (l {/ .(x) = l } - l{g(x) = l}) , 

where the second equality follows by noting that l{ s (a;)=o} = 1 — l{o(x)=i}- Next recall 

1, 77 (a;) > 1/2 

/*(*) = { • (3-H) 

0, otherwise 

For x such that r\ (x) > 1/2, we have 

(2TKaO-l) (l{/- (*)=!} -l{ fl (x)=i} ) (3-12) 



(2»?(a;) -1)( l {/ »(x)=i} -l {s( x)=i} I, (3.13) 

y v « 'Oorl/ 



<0 C 

<0 



13 
which implies 

(2 V (x) - 1) (l { /. (x) =i } - l {fl( x)=i}) > (3.14) 

or 

P (g (X) ^Y\X = x)>P (/* (a;) ? Y\X = x) . (3.15) 



Note that while the Bayes classifier achieves the Bayes risk, in practice this classifier is not realizable 
because we do not know the distribution Pxy an d so cannot construct r/(x). 

3.2 Regression 

The goal of regression is to learn a mapping from the input space, X, to the output space, y. This mapping, 
/, is called a estimator. For example, we might have 

X = R d , 

(3.16) 
y = R. 

We can measure the loss of our estimator using squared error loss; i.e., 

*(v,v\ = (y-y) ■ (3.17) 

Recalling that risk is defined to be the expected value of the loss function, we have 

R (/) = E XY [£ (f (X) , Y)] = E XY [(/ (X) - Yf] . (3.18) 

The performance of a given estimator can be evaluated in terms of how close the risk is to the infimum of 
the risk for all estimator under consideration: 

R*=infR(f). (3.19) 

/ 

Theorem 3.2: Minimum Risk under Squared Error Loss (MSE) 
Let /* (x) = E Y{X [Y\X = x] 

R(f*)=R*. (3.20) 
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Proof: 

R(f) = E X y[(f(X)-Yf] 

E x [E Y]x [{f{X)-Y) 2 \x\] 
Ex [e Y \x [(/ (X) - E Y \ X [Y\X] + E Y{X [Y\X] - Y) 2 \x\ 



Ex[ E Y]x [(f{X)-E Y]x \Y\X]) 2 \x\ 

+2E Ylx [(/ (X) - E Ylx [Y\X]) (E Y]X [Y\X] - Y) \X] (3.21) 

+E Ylx [(E Ylx [Y\X}-Y) 2 \x] 



E x [ E Ylx [(f(X)-E Ylx [Y\X}) 2 \X 
+2(f(X)-E Ylx [Y\X})xO 
+E Ylx [(E Ylx [Y\X}-Y) 2 \X 

E XY [(/ (X) - E Ylx [Y\X]) 2 ] + R (f* 



Example 

Thus if /* (ar) = E Y \ X [Y\X = x], then R (/*) = R*, as desired. 



3.3 Empirical Risk Minimization 

Definition 3.3: Empirical Risk 

Let {Xi, Yi}2- 1 ~ P X y be a collection of training data. Then the empirical risk is defined as 



I n 
RnU) = -Y,Z{f '(Xi) ,Yi) . (3.22) 



n 

i=l 



Empirical risk minimization is the process of choosing a learning rule which minimizes the empirical 



risk; i.e., 

fn = argminR n (/) . (3.23) 

Example 3.1: Pattern Classification 

Let the set of possible classifiers be 

f={iH sign (w'x) : w € R d } (3.24) 

and let the feature space, X, be [0, 1] or R d . If we use the notation /„, (x) = sign (w'x), then the 
set of classifiers can be alternatively represented as 

T = {f w : w e K d }. (3.25) 



In this case, the classifier which minimizes the empirical risk is 
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argminRn (/) 



(3.26) 




Figure 3.1: Example linear classifier for two-class problem. 



Example 3.2: Regression 

Let the feature space be 



and let the set of possible estimators be 



X = [0, 1] 



T = {degree d polynomials on [0,1]}. 
In this case, the classifier which minimizes the empirical risk is 



(3.27) 
(3.28) 



/„ = argminR n (/) 

= argmin\ £? =1 (/ (JQ) - Ytf 



Alternatively, this can be expressed as 



arg min ^J27=i{ w o + WiXi + ... + w d X?-Yi)' 



arg min || Vw — Y 



(3.29) 



(3.30) 
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where V is the Vandermonde matrix 



V 



1 Xx 

1 X 2 



Xf 

xl 



1 X n ... x° 

The pseudoinverse can be used to solve for w: 

w= (v'vyW'Y. 

A polynomial estimate is displayed in Figure 3.2. 




0.1 0.2 0.3 0.4 0.5 06 0.7 



(3.31) 



(3.32) 



Figure 3.2: Example polynomial estimator. Blue curve denotes /*, magenta curve is the polynomial 
fit to the data (denoted by dots). 



3.4 Overfitting 

Suppose T , our collection of candidate functions, is very large. We can always make 

minRn (f) (3.33) 

smaller by increasing the cardinality of T, thereby providing more possibilities to fit to the data. 

Consider this extreme example: Let T be all measurable functions. Then every function / for which 



/(*) = { 



Y h 
any value, 



x = Xi for i = 1, ..., n 
otherwise 



(3.34) 



has zero empirical risk (R n (/) = 0). However, clearly this could be a very poor predictor of Y for a new 
input X . 
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Example 3.3: Classification Overfitting 

Consider the classifier in Figure 3.3; this demonstrates overfitting in classification. If the data 
were in fact generated from two Gaussian distributions centered in the upper left and lower right 
quadrants of the feature space domain, then the optimal estimator would be the linear estimator 
in Figure 3.1; the overfitting would result in a higher probability of error for predicting classes of 
future observations. 




Figure 3.3: Example of overfitting classifier. The classifier's decision boundary wiggles around in order 
to correctly label the training data, but the optimal Bayes classifier is a straight line. 



Example 3.4: Regression Overfitting 

Below is an m-file that simulates the polynomial fitting. Feel free to play around with it to get an 
idea of the overfitting problem. 

7,~poly~f itting 
y ~rob~nowak~~l/24/04 
clear 
close~all 

°/ ~generate~and~plot~" true ""function 

t~=~(0: .001:1)'; 

f~=~exp(-5*(t-.3) .-2)+.5*exp(-100*(t-.5) . ~2) + . 5*exp(-100*(t- . 75) .~2) ; 

figure (1) 

plot(t,f) 



°/ ~generate~n~training~data~&~plot 

n~=~10; 

sig~=~0 . 1 ; ~°/,~std~of "noise 

x~=~.97*rand(n,l)+.01; 

y~=~exp(-5*(x-.3) . ~2)+. 5*exp(-100*(x- .5) . ~2)+. 5*exp(-100*(x- . 75) . ~2)+sig*randn(size(x) ) ; 

figure (1) 

elf 

plot(t.f) 

hold~on 
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plot(x,y,'.') 

7,~f it ~with~polynomial~of "order ~k~~ (poly ~degree~up~to~k-l) 

k=3; 

f or~i=l :k 

V(:,i)~=~x.~(i-1); 

end 
p~=~inv(V'*V)*V'*y; 

f or~i=l :k 

Vt(:,i)~=~t.-(i-l); 

end 

yh~=~Vt*p; 
figure (1) 
elf 

plot(t,f) 
hold~on 
plot(x,y,'.') 
plot(t,yh, 'm') 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



(a) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 OS 0.9 1 



(b) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



(c) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



(d) 



Figure 3.4: Example polynomial fitting problem. Blue curve is /*, magenta curve is the polynomial fit 
to the data (dots), (a) Fitting a polynomial of degree d — 0: This is an example of underfitting (b)d = 2 
(c) d — 4 fd) d — 6: This is an exanrole of overfittine. The enroirical loss is zero, but clearlv the estimator 



20 CHAPTER 3. INTRODUCTION TO CLASSIFICATION AND REGRESSION 



Chapter 4 

Introduction to Complexity 
Regularization 1 

4.1 Competing Goals: The Bias- Variance Tradeoff 

We ended the previous lecture (Chapter 3) with a brief discussion of overfitting. Recall that, given a set of n 
data points, D n , and a space of functions (or models) T , our goal in solving the learning from data problem 



is to choose a function / € T which minimizes the expected risk E 



R\fr, 



, where the expectation is 



being taken over the distribution Pxy on the data points D n . One approach to avoiding overfitting is to 
restrict T to some subset of all measurable function. To gauge the performance of a given / in this case, we 
examine the difference between the expected risk of / and the Bayes' risk (called the excess risk). 



E 



R[fr 



R* 



E 



R\fn 



inf fer R(f)\ + (inf f€r R(f)-R*) 

s approximation error 



(4.1) 



estimation error 



The approximation error term quantifies the performance hit incurred by imposing restrictions on T. 
The estimation error term is due to the randomness of the training data, and it expresses how well the 

chosen function f n will perform in relation to the best possible / in the class T. This decomposition into 
stochastic and approximation errors is similar to the bias- variance tradeoff which arises in classical estimation 
theory. The approximation error is like a bias squared term, and the estimation error is like a variance term. 
By allowing the space .Fto be large 2 we can make the approximation error as small as we want at the cost 
of incurring a large estimation error. On the other hand, if J^is very small then the approximation error will 
be large, but the estimation error may be very small. This tradeoff is illustrated in Figure 4.1. 



lr This content is available online at <http://cnx.Org/content/ml6274/l.2/>. 
2 When we say T'\& large, we mean that \T\, the number of elements in T, is large. 
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estimation 
error 



approximation 
error 



Complexity of F 



Figure 4.1: Illustration of tradeoff between estimation and approximation errors as a function of the 
size (complexity) of the T. 



Why is this the case? We do not know the true distribution Pxy on the data, so instead of minimizing 
the expected risk of we design a predictor by minimizing the empirical risk: 



/„ = argminRn (/) , 

Rn(f) = £E?=l*(/(*i),*i) 



(4.2) 



If JRs very large then R n (/) can be made arbitrarily small and the resulting /„ can "overfit" to the data 
since R n (/) is not a good estimator of the true risk R I f n ] . 
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Prediction 
Error 



empirical risk 




true risk 



underfilling 



Best 
Model 



overfilling 



Complexity 



Figure 4.2: Illustration of empirical risk and the problem of overfitting to the data. 



The behavior of the true and empirical risks, as a function of the size (or complexity) of the space T , is 
illustrated in Figure 4.2. Unfortunately, we can't easily determine whether we are over or underfitting just 
by looking at the empirical risk. 

4.2 Strategies To Avoid Overfitting 

Picking 



fn = argminR n (/) 
is problematic if JRs large. We will examine two general approaches to dealing with this problem: 



(4.3) 



1. Restrict the size or dimension of ^"(e.g., restrict jFto the set of all lines, or polynomials with maximum 
degree d). This effectively places an upper bound on the estimation error, but in general it also places 
a lower bound on the approximation error. 

2. Modify the empirical risk criterion to include an extra cost associated with each model (e.g., higher 
cost for more complex models): 



/„ = argmin{R n (f) + C(f)}. 



(4.4) 



The cost is designed to mimic the behavior of the estimation error so that the model selection procedure 
avoids models with a estimation error. Roughly this can be interpreted as trying to balance the tradeoff 
illustrated in Figure 4.1. Procedures of this type are often called complexity penalization methods. 
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Example 4.1 

Revisit the polynomial regression example (Lecture 2, Ex. 4) (Example 3.4: Regression Over- 
fitting), and incorporate a penalty term C (/) which is proportional to the degree of /, or the 
derivative of /. In essence, this approach penalizes for functions which are too "wiggly", with the 
intuition being that the true function is probably smooth so a function which is very wiggly will 
overfit the data. 

How do we decide how to restrict or penalize the empirical risk minimization process? Ap- 
proaches which have appeared in the literature include the following. 



4.2.1 Method of Sieves 

Perhaps the simplest approach is to try to limit the size of jFin a way that depends on the number of training 
data n. The more data we have, the more complex the space of models we can entertain. Let the class of 
candidate functions grow with n. That is, take 

F u ? 2 ,><- ,? n ,->- (4.5) 

where \Ti\ grows as i — » oo. In other words, consider a sequence of spaces with increasing complexity or 
degrees of freedom depending on the number of training data samples, n. 

Given samples {Xi,Yi}2 =1 i.i.d. distributed according to Pxy, select / e T n to minimize the empirical 
risk 

f n = argminR„(f). (4.6) 

In the next lecture (Chapter 5) we will consider an example using the method of sieves. The basic idea is to 
design the sequence of model spaces in such a way that the excess risk decays to zero asm oo. This sort 
of idea has been around for decades, but Grenander's method of sieves is often cited as a nice formalization 
of the idea: Abstract Inference, Wiley, New York. 

4.2.2 Complexity Penalization Methods 

4.2.2.1 Bayesian Methods 

In certain cases, the empirical risk happens to be a (log) likelihood function, and one can then interpret the 
cost C (/) as reflecting prior knowledge about which models are more or less likely. In this case, e~ c ^ is 
like a prior probability distribution on the space T '. The cost C (/) is large if / is highly improbable, and 
C (/) is small if / is highly probable. 

Alternatively, if we restrict jFto be small, and denote the space of all measurable functions as F = ,FU.F C , 
then it is essentially as if we have placed a uniform prior over all functions in T ', and zero prior probability 
on the functions in T c . 

A.1.1.1 Description Length Methods 

Description length methods represent each / with a string of bits. More complicated functions require more 
bits to represent. Accordingly, we can then set the cost c(/) proportional to the number of bits needed to 
describe / (the description length). This results in what is known as the minimum description length 
(MDL) approach where the minimum description length is given by 

rnin{R n (f) + C(f)}. (4.7) 

J kzJ~ 
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In the Bayesian setting, p(f) oc e^ c ^ can be interpreted as a prior probability density on T, with more 
complex models being less probable and simpler models being more probable. In that sense, both the 
Bayesian and MDL approaches have a similar spirit. 

4.2.2.3 Vapnik-Cervonenkis Dimension 

The Vapnik-Cervonenkis (VC) dimension measures the complexity of a class ^"relative to a random sample 
of n training data. For example, take !Fto be all linear classifiers in 2-dimensional feature space. Clearly, the 
space of linear classifiers is infinite (there are an infinite number of lines which can be drawn in the plane). 
However, many of these linear classifiers would assign the same labels to the training data. 

The number of unique labellings of the training data that can be achieved with linear classifiers is, in 
fact, finite. A line can be defined by picking any pair of training points, as illustrated in Figure 4.3. Two 
classifiers can be defined from each such line: one that outputs a label "1" for everything on or above the 
line, and another that outputs "0" for everything on or above. There exist ( ™ ) such pairs of training points, 
and these define all possible unique labellings of the training data. Therefore, there are at most 2 (™) unique 
linear classifiers for any random set of n 2-dimensional features (the factor of 2 is due to the fact that for 
each linear classifier there are 2 possible assignments of the labelling) . 




Figure 4.3: Fitting a linear classifier to 2-dimensional data. There are an infinite number of such 
classifiers. We can generate a linear classifier by choosing two data points, drawing a line with both 
points on one side, and declaring all points on or above the line to be " + 1" (or " — 1") and all points below 
the line to be "-1" (or " + !")• 
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Figure 4.4: From the discussion in the previous figure, we see that the two linear classifiers depicted 
in this figure are equivalent for this set of data points, and hence relative to the set of n training data 
there are only on the order of n 2 unique linear classifiers. 



Thus, instead of infinitely many linear classifiers, we realize that as far as a random sample of n training 
data is concerned, there are at most 



2(2) 



2/i! 



(n-2)!2! 

n(n — 1) 
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unique linear classifiers. That is, using linear classification rules, there are at most n(n — 1) w n 2 unique 
label assignments for n data points. If we like, we can encode each possibility with log 2 n (n — 1) w 2log 2 n 
bits. In d dimensions there are 2 (*J) hyperplane classification rules which can be encoded in roughly dlog 2 n 
bits. Roughly speaking, the number of bits required for encoding each model is the VC dimension. The 
remarkable aspect of the VC dimension is that it is often finite even when T is infinite (as in this example). 
If Afhas d dimensions in total, we might consider linear classifiers based on 1, 2, • • ■ , d features at a time. 
Lower dimensional hyperplanes are less complex than higher dimensional ones. Suppose we set 



T\ = linear classifiers using 1 feature 
T 2 = linear classifiers using 2 features ■ 
and so on 



(4.9) 



These spaces have increasing VC dimensions, and we can try to balance the empirical risk and a cost function 
depending on the VC dimension. Such procedures are often referred to as Structural Risk Minimization. 
This gives you a glimpse of what the VC dimension is all about. In future lectures we will revisit this topic 
in greater detail. 
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4.2.3 Hold-out Methods 

The basic idea of "hold-out" methods is to split the n samples D = {AT,, Y;}™ =1 into a training set, D?, and 
a test set, Dy. 

D T = {X i ,Y i y!L x , D v = {X i ,Y i }? =m+1 . ( 4 - 10 ) 

Now, suppose we have a collection of different model spaces {!F\} indexed by A € A (e.g., T\ is the set of 
polynomials of degree d, with A = d), or suppose that we have a collection of complexity penalization criteria 

L\ (/) indexed by A (e.g., let L\ (/) =R (/) + Ac(/), with A € R + ). We can obtain candidate solutions 
using the training set as follows. Define 

Rra(f) = YZl^f [X l ) ,Y t ) (4-11) 

and take 

f\ = argminR m {f) (4.12) 

or 

fx = argmin{R m (f) + \c(f)} • (4.13) 

This provides us with a set of candidate solutions {fx}- Then we can define the hold-out error estimate 
using the test set: 

Rv(f) = ^EUi'(/W. y i). (414) 

and select the "best" model to be /= f\ where 

A 

A = argminRy I /a J • (4-15) 

This type of procedure has many nice theoretical guarantees, provided both the training and test set grow 
with n. 

4.2.3.1 Leaving-one-out Cross- Validation 

A very popular hold-out method is the so call "leaving-one-out cross-validation" studied in depth by Grace 
Wahba (UW-Madison, Statistics). For each A we compute 

fx = argmin^j:^nf(X l ),Y) + XC(f) ( 4 -!6) 

or 

-CO 

f x = argmin^YT^ Hf ' PQ) ,*i) • ( 417 ) 



Then we have cross-validation function 

v(X) = ^El =1 i(f ( x k) (x k ),Y k ) 



(4.18) 
A* = argminV (A) . 
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4.3 Summary 

To summarize, this lecture gave a brief and incomplete survey of different methods for dealing with the issues 
of overfitting and model selection. Given a set of training data, D n = {X i} i^}" =1 , our overall goal is to find 

f* = argminR{f) (4.19) 

from some collection of functions, T. Because we do not know the true distribution Pxy underlying the 
data points D n , it is difficult to get an exact handle on the risk, R (/). If we only focus on minimizing the 

empirical risk R (/) we end up overfitting to the training data. Two general approaches were presented. 

1. In the first approach we consider an indexed collection of spaces {3 7 \} X( - A such that the complexity of 
T\ increases as A increases, and 

Urn T x = T. (4.20) 

A — >oo 



A solution is given by 



f x , = arg min R n (/) (4-21) 



where either A* is a function which increases with n, 

A* = A(n), (4-22) 

or A* is chosen by hold-out validation. 
2. The alternative approach is to incorporate a penalty term into the risk minimization problem formula- 
tion. Here we consider an indexed collection of penalties {C\} XeA satisfying the following properties: 

a. C x ■■ T^K + ; 

b. For each / g T and Ai < A2 we have C\ 1 (f) < C\ 2 (/); 

c. There exists Ao € A such that C\ (f) = for all / s T . 

In this formulation we find a solution 

f\, = argminR n (f) + C x .(f), (4.23) 

where either A* = A (n), a function growing the number of data samples n, or A* is selected by hold-out 
validation. 

4.4 Consistency 

If an estimator or classifier / A „ satisfies 

E R\f x . 



infR{f) as n — > 00, (4.24) 



then we say that / \» is ^"-consistent with respect to the risk R. When the context is clear, we will simply 
say that / is consistent. 



Chapter 5 

An Example of the Use of Sieves for 
Complexity Regularization in Denoising 1 

Consider the following setting. Let 

Y=f* (X) + W, (5.1) 

where X is a random variable (r.v.) on X = [0, 1], Wisa r.v. on y = R, independent of X and satisfying 

E [W] = and E [W 2 ] = a 2 < oo. (5.2) 

Finally let /* : [0, 1] — > R be a function satisfying 

|/* (t) -/*(*) | <L|t-s|, Vt,«e[0,l], (5.3) 

where L > is a constant. A function satisfying condition (5.3) is said to be Lipschitz on [0, 1]. Notice that 
such a function must be continuous, but it is not necessarily differentiable. An example of such a function 
is depicted in Figure 5.1(a). 



1 This content is available online at <http://cnx.Org/content/ml6261/l.3/>. 
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(b) 

Figure 5.1: Example of a Lipschitz function, and our observations setting, (a) random sampling of /*, 
the points correspond to (Xi,Yi) , i — 1, ..., n; (b) deterministic sampling of /*, the points correspond 
to (i/n, Yi) , i = l,. ..,n. 



Note that 



E[Y\X = x] = E [/* {X) + W\X = x] 
= E [f* (x) + W\X = x] 
= r(x) + E[W} = f*(x). 
Consider our usual setup: Estimate /* using n training examples 

{XuYifei*'"' Pxy, 
Y i = f*(X i )+W i , i={l,...,n}, 



(5.4) 



(5.5) 



where ~ means independently and identically distributed. Figure 5.1(a) illustrates this setup. 
In many applications we can sample X = [0, 1] as we like, and not necessarily at random. For example 
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we can take n samples uniformly on [0,1] 

X{ — , l 1 , . . . , Ti , 

Y t = f(x t ) + W t 
We will proceed with this setup (as in Figure 5.1(b)) in the rest of the lecture. 



(5.6) 



Our goal is to find f n such that E 
~ 2 

llr-/nl| 2 = /ol/*(*)-/n(t)l dt). 

Let 



r - f n \ 



0, as n — > (here || ■ || is the usual L 2 -norm; i- e - 



T = {/ : /is Lipschitz with constant L}. 



(5.7) 



The Risk is defined as 



R(f)=\\r-ff= f 

Jo 



\r(t)-f(t)\ 2 M. 



(5.1 



The Expected Risk (recall that our estimator f n is based on {xi,Yi} and hence is a r.v.) is defined as 



R\fn 



E 



r-ir 



Finally the Empirical Risk is defined as 



(5.9) 



(5.10) 



Let < m\ < m 2 < m 3 < • • • be a sequence of integers satisfying m n — > oo as n — * oo, and k n m n = n for 
some integer k n > 0. That is, for each value of n there is an associated integer value m n . Define the Sieve 

F\, ^2, ^3) —j 



?n — {/ = / (t) — / J Cj lfi=± <t< _i_-y, Cj 

3 = 1 

T n is the space of functions that are constant on intervals 

' 3 ~ ! 3 



eR}. 



I, 



t n nt n 



j = l,...,m r , 



(5.11) 



(5.12) 



From here on we will use m and k instead of m n and k n (dropping the subscript n) for notational ease. 
Define 



fn (t) = Yl c *j 1 {*e^,m}> where 



I zZ r 



i-^eii. 



(5.13) 



Note that /„ G T n 
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Example 5.1: Exercise 1 

Upper bound || /* - f n f. 



r-f 



Jo \r (t) - u (t) \ 2 m 
e^i/^^Ie*^.,,. (/*(*)-/* a)) I)" 



dt 



(5.14) 



< E^/^UE.^jrw-rO 



dl 



< 



Er=i/ Jiim (^) a * 



rf/ 



^—*j = 1 m\m) \mj 

The above implies that || /* — /„|| — » as n — » oo, since m = m„ — > oo as n — > oo. In words, with n 
sufficiently large we can approximate /* to arbitrary accuracy using models in T n (even if the functions we 
are using to approximate /* are not Lipschitz!). 
For any / G T n J = YJj=i c o 1 {tei j , m }, we have 






(5.15) 



Let /„ = ar grain fe yr n R n (/) • Then 



/™( i ) = E c J' 1 {*e/,, m }' where 



E * 



i--^eij, 



Example 5.2: Exercise 2 

Show (5.16). 



Note that E 



and therefore E 1 



/„(*) 



(5.16) 



/„ (i). Lets analyze now the expected risk of 



fn- 



E 



r - fr. 



E 



J Jn + Jn J r , 



r-fn\r + E 



r-fn\\ 2 + E 



In J r 



Jn J n \ 



2E 



^ J Jn i Jn J n -^ 



2<r-f n ,E 



Jn J n 



> 



(5.17) 



J Jn 



E 



In Jr. 
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where the final step follows from the fact that E 



/„(*) 



f n (t). A couple of important remarks 



pertaining the right-hand-side of equation (5.17): The first term, || /* — f n \\ , corresponds to the 
approximation error, and indicates how well can we approximate the function /* with a function 
from T n . Clearly, the larger the class T n is, the smallest we can make this term. This term is 



precisely the squared bias of the estimator f n . The second term, E 



Jn J n\ 



, is the estimation 



error, the variance of our estimator. We will see that the estimation error is small if the class of 
possible estimators T n is also small. 

The behavior of the first term in (5.17) was already studied. Consider the other term: 



E 



Jn J r 



E 
E 



Jo \fn (t) ~ fn (t) I dt 






(II 






IS " c 



(11 



(II 



< 



2^j=i J i jm k m 



E 



j — 1 m k 



Combining all the facts derived we have 



E 



r-fj 



O [ max{ — k, — } 
m z n 



(5.18) 



(5.19) 



This equation used Big-0 notation. 

What is the best choice of to? If to is small then the approximation error (i.e., O (l/m 2 )) is 
going to be large, but the estimation error (i.e., O (m/n)) is going to be small, and vice-versa. 
This two conflicting goals provide a tradeoff that directs our choice of to (as a function of n) . In 
Figure 5.2 we depict this tradeoff. In Figure 5.2(a) we considered a large to„ value, and we see that 
the approximation of /* by a function in the class T n can be very accurate (that is, our estimate 
will have a small bias), but when we use the measured data our estimate looks very bad (high 
variance). On the other hand, as illustrated in Figure 5.2(b), using a very small to„ allows our 
estimator to get very close to the best approximating function in the class T n , so we have a low 
variance estimator, but the bias of our estimator (i.e., the difference between /„ and /*) is quite 
considerable. 



34 



CHAPTER 5. AN EXAMPLE OF THE USE OF SIEVES FOR COMPLEXITY 

REGULARIZATION IN DENOISING 




(a) 




(b) 

Figure 5.2: Approximation and estimation of /* (in blue) for n — 60. The function /„ is depicted in 
green and the function f n is depicted in red. In (a)we have m — 60 and in (b) we have ra = 6. 



We need to balance the two terms in the right-hand-side of (5.19) in order to maximize the rate 
of decay (with n) of the expected risk. This implies that -\ = — therefore m n = n 1 / 3 and the 
Mean Squared Error (MSE) is 



E 



Jn J n \ 



O (n- 2 / 3 ) 



(5.20) 



So the sieve T\, J-<z 



with 



Fn = {/:/(<) = X! °i 1 {^ A <t<-, 



}' 



eR}, 



(5.21) 



produces a ^"-consistent estimator for f* = E [Y\X + x] G T. 

It is interesting to note that the rate of decay of the MSE we obtain with this strategy cannot be 
further improved by using more sophisticated estimation techniques (that is, n~ 2 / 3 is the minimax 
MSE rate for this problem). Also, rather surprisingly, we are considering classes of models T n that 
are actually not Lipschitz, therefore our estimator of /* is not a Lipschitz function, unlike /* itself. 



Chapter 6 

Plug-In Classifier and Histogram 
Classifier 1 



We return to the topic of classification, and we assume an input (feature) space X and a binary output (label) 
space y = {0, 1}. Recall that the Bayes classifier (which minimizes the probability of misclassification) is 
defined by 

1 p(Y = l\X = x) > 1/2 

/*(*) = {' • (6-1) 

0, otherwise 

Throughout this section, we will denote the conditional probability function by 

r)(x) = P{Y = 1\X = x) • (6.2) 



6.1 Plug-in Classifiers 

One way to construct a classifier using the training data {X il Y i } n i=l is to estimate r\ (x) and then plug-it 
into the form of the Bayes classifier. That is obtain an estimate, 

ri n (x)= V (x;{X l ,Y l } n l=1 ) (6.3) 

and then form the "plug-in" classification rule 

}{x) = { 1 > V {X) * 1/2 . (6.4) 

0, otherwise 

Remark: The function r\ (x) is generally more complicated than the ultimate classification rule 
(binary- valued) , as we can see 

T) : X -> [0, 1] 

L J . (6.5) 

/:*-{<), 1} 



1 This content is available online at <http://cnx.Org/content/ml6280/l.2/>. 
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Therefore, in this sense plug-in methods are solving a more complicated problem than necessary. However, 
plug-in methods can perform well, as demonstrated by the next result. 

Theorem 6.1: Plug-in Classifier 
Let fj be an approximation to r], and consider the plug-in rule 

1, n (a;) > 1/2 

/(*) = { '\>- . (6.6) 

0, otherwise 

Then, 

R(f)-R* <2E[\ V (x)-fj(x)\] (6.7) 

where 

R(f) = P(f(X)^Y) 

R* = R(f*)= in fR(f) ■ (6 - 8) 

/ 

Proof: 

Consider any x € R d . In proving the optimality of the Bayes classifier /* in Lecture 2 (Chapter 3), 
we showed that 

P(f(x)^Y\X = x)-P(f*(x)^Y\X = x) = (2r ? (c C )-l)[l {/ » (x)=1} -l {/(x)=1} ], (6.9) 

which is equivalent to 

P(f(x)^Y\X = x)-P(f*(x)^Y\X = x) = |2»i(a;)-l|l {/ . (x)?4/(iB ) } , (6.10) 

since /* (x) = 1 whenever 2r\ (x) — 1 > 0. Thus, 

P(f(X)^Y)-R* = J Rd 2\r,(x)-l/2\l {f , {xWix)}Px ( X )dx 

where px (x) is the marginal density of X 

< !^\ri{x)-f]{x)\l {f , {x) ^ f{x)} p x {x)dx (6.11) 

< f Rd 2\r/ (x) - rj (x) \p x {x) dx 

2E[\ V (X)-fj(X)\] 
where the first inequality follows from the fact 

f(x)^f*(x) =► |»?(a;)-77(a;)|>|»7(a;)-l/2| (6.12) 

and the second inequality is simply a result of the fact that ^-{f( x )=tf{x)} ls either or 1. 
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►X 



Figure 6.1: Pictorial illustration of \i){x) — fj (x) \ > \r] (x) — 1/2| when / (x) / /* (x). Note that the 
inequality P (f (X) 7^ Y) — R* < j Rd 2\rj(x) — i)(x) |1{/* (x)^f{x)}Px (x) dx shows that the excess risk 
is at most twice the integral over the set where /* (x) / / (x). The difference 1 77 (x) — i) (x) | may be 
arbitrarily large away from this set without effecting the error rate of the classifier. This illustrates the 
fact that estimating 77 well everywhere (i.e., regression) is unnecessary for the design of a good classifier 
(we only need to determine where 77 crosses the 1/2-level). In other words, "classification is easier than 
regression." 



The theorem shows us that a good estimate of r\ can produce a good plug-in classification rule. 
By "good" estimate, we mean an estimator f\ that is close to 77 in expected Li-norm. 



6.2 The Histogram Classifier 



Let's assume that the (input) features are randomly distributed over the unit hypercube X = [0, 1] (note 
that by scaling and shifting any set of bounded features we can satisfy this assumption), and assume that 
the (output) labels are binary, i.e., y = {0, 1}. A histogram classifier is based on a partition the hypercube 



[0, 1] into M smaller cubes of equal size. 

Example 6.1: Partition of hypercube in 2 dimensions 

Consider the unit square [0, 1] and partition it into M subsquares of equal area (assuming M is 
a squared integer). Let the subsquares be denoted by {Qi}, i = 1, ..., M. 
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1 



Jm" 2 



t> 



M 



Figure 6.2: Example of hypercube [0, 1] in M equally sized partition 



Define the following piecewise-constant estimator of r\ (x): 



M 






where 



Pi 



Y, r j=i 1 {x i eQ j ,Y i =i} 



(6.13) 



(6.14) 



Like our previous denoising examples, we expect that the bias of V n will decrease as M increases, 
but the variance will increase as M increases. 

Theorem 6.2: Consistency of Histogram Classifiers 

If M — » oo and jj — > oo as n — » oo, then the histogram classifier risk converges to the Bayes risk 
for every distribution Pxy with marginal density px (x) > c, for some constant c > 0. 2 . 

What the theorem tells us is that we need the number of partition cells to tend to infinity (to 
insure that the bias tends to zero), but they can't grow faster than the number of samples (i.e., we 
want the number of samples per box tending to infinity to drive the variance to zero) . 
Proof: 



Let Pi 



J Q v(x)px(x)dx 
S Q .Px(x)dx 



(the theoretical analog of Pj) and define 



M 






(6.15) 



The function r) is the theoretical analog of V (i.e., the function obtained by averaging r\ over the 



2 Actually, the result holds for every distribution Pxy ■ For the more general theorem, refer to Theorem 6.1 in A probabilistic 
Theory of Pattern Recognition by Luc Devroye, Laszlo Gyorfi and Gabor Lugosi. 



partition cells). By the triangle inequality, 



E 



\v n (x)- v (x)\ 



< E 



\V n (X)-rj(X)\ 



E stimationEr 



+ E[\rj n (X ) -r,(X)\] 

Approximation Error 
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(6.16) 



Let's first bound the estimation error. For any x € [0, 1] , let Q (x) denote the histogram bin in 
which x falls in. Define the random variable 



N ( x ) = J2 1 {x i eQ( x 



)} 



(6.17) 



If Q (x) = Qj, then this random variable is simply nPj. Note that 



Vn(x) 



N(x) 



B(x) 



(6.18) 



where B (x) == Yh=\ l{ Xi eQ(x), y i= i} = Hi:Xi£Q(x) Y i- B ( x ) is simpiy th e number of samples 



in cell Q (x) labelled 1. Now r l n (x) is a fairly complicated random variable, but the conditional 
distribution of B (x) given N (x) is relatively simple. Note that 



B (x) | N (x) = k ~ Binomial (k,r] (x)) 



(6.19) 



since fj (x) is the probability of a sample in Q (x) having the label 1 and we are conditioning on 
the event of observing k samples in Q (x) . 
Now consider the conditional expectation 



E 



E 



Vn (x) - 7] (x) 



N (x) = k 



< { 



B(x) 

N(x) 



t — t] (x) | TV (x) = k 



k>0 
(6.20) 
k = (since < fj (x) < I) 



Next note that 



E 



^-rj(x)\\N(x) = k] 



< 



E 



E 



^l-fj(x)\\N(x) = k] 



l\B(x)-krj{x)\ \N(x) = k 

E[B(x)] 



E \\B{x)- krj{x)\ 2 \N{x) = k\ 



conditional variance of B(x) 



\ 



by the Jensen's inequality, E[\Z\] < (E [\Z\ 2 ]Y 
Therefore, 



E 



B{x) 
N(x) 



T}{x) 



\N(x) = k] < l{krj{x){l-r}{x))Y 



■q(x)(l~ri(x)) 
k 



(6.21) 



(6.22) 
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and 



E 



or in other words, 



\V n {x)-fj{x)\ \N{x) = k 



< { 



E 



\V n (x)-rj(x)\ \N{x) = k 



< 



J V (X)(1-T,(X) [ k>Q 



r\ (x) (1 — X] (x) 



N(x) 



k = 



l{A , (a;)>0} + l{Af(x)=0} 



Now taking expectation with respect to N (x) 



(6.23) 



(6.24) 



E 



N 



E 



\V n (x) -fj(x) \N(x) = k 



< 



E 



ri(x)(l-ri(x) 



P(N(x) = 0) < E^^^^oy 
^P(N(x)>k) + P(N(x) = 0) 



v \/ ' N(x) ' L {N(x)>0} 

+ P(N(x) = 0) < \P(N(x)<k) 



(6.25) 



<l 



Now a key fact is that for any k > 0, P (N < k) 



as n 



oo. This follows from the 



M 



ooasn^ co. 



assumption that the marginal density px (x) > c, for some constant c > 0, and 

This result is easily verified by contradiction. If P (N < k) ^ q > as n ^ oo, then Px (x) > is 

contradicted. Thus, for any e > there exists a k > such that —j= < e and P (N < k) < e for n 

sufficiently large. Therefore, for n sufficiently large and every x G [0, 1] , 



E 



\V n (x)-rj(x)\ 



< 3e 



(6.26) 



where the expectation is with respect to the distribution of the sample {Xi, Yi}f =1 . Thus, 



E 



\V n (X)-fj(X)\ 



< 3e 



(6.27) 



where the expectation is now with respect to the distribution of the sample and the marginal 
distribution of X. 

Next consider the approximation error E [\rj n (X) — r\ (X) |], where the expectation is over X 
alone. The function rj may not itself be continuous, but there is another function r\ e that is uniformly 
continuous and such that E [\rj E (X) — r\ (X) |] < e. Recall that uniformly continuous functions can 
be well approximated by piecewise constant functions. 

By the triangle inequality, 



E [\rj -v\]<E [\rj - rj e \] + E [\rj e - Ve \] + E [\ Ve - r,[ 



(6.28) 



<E 



<e by 



where i le (x) = Ejli [i Q^e (x) Px (x) cte'J ^-{xgq,}- 

E[\rj(X)-rj £ (X)\] = E7=i [j Qj \v (x) - Ve (x) \ P x (x) dx] l {xeQj} 



< 



(6.29) 



and since r) e is uniformly continuous, 



E [\Ve(X)-Ve(X)\] = J2j = llQ i \Ve( x )-Ve{x)\l{ x eQ j }Px{x)dx 



^M 



< J2j = i$ P ( x e Qj) > where 8 depends on M 



8, 



^M 



since E =1 P (X € Qj-) = 1 



By taking M sufficiently large, 6 can be made arbitrarily small. So for large M, 8 < e. 
Thus, we have shown 
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(6.30) 



E[\rj(X)-r,(X)\]<3e 
for sufficiently large M. Since e > was arbitrary, we have shown that taking 



(6.31) 



1, V n (x)>l/2 



/„(*) = { 



0, otherwise 



satisfies 



if 



P[f n (X)^Y\ -P(f*(X)^Y) < IE 



\V„(X)-r,(X)\ 



(6.32) 



(6.33) 



M 

n 

M 



oc 
oo as n- 



(6.34) 



Note: P\f n {X)±Y\=E 
distributions of (X,Y) and {X,,^}^. 



is the expected risk of /, with expectation over the 
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Chapter 7 

Probably Approximately Correct (PAC) 
Learning 1 

7.1 Introduction 

7.1.1 Overview of the Learning Problem 

The fundamental problem in learning from data is proper Model Selection. As we have seen in the previous 
lectures, a model that is too complex could overfit the training data (causing an estimation error) and a 
model that is too simple could be a bad approximation of the function that we are trying to estimate (causing 
an approximation error). The estimation error arises because of the fact that we do not know the true joint 
distribution of data in the input and output space, and therefore we minimize the empirical risk (which, for 
each candidate model, is a random number depending on the data) and estimate the average risk again from 
the limited number of training samples we have. The approximation error measures how well the functions 
in the chosen model space can approximate the underlying relationship between the output space on the 
input space, and in general improves as the "size" of our model space increases. 

7.1.2 Lecture Outline 

In the preceding lectures, we looked at some solutions to deal with the overfitting problem. The basic 
approach followed was the Method of Sieves, in which the complexity of the model space was chosen as a 
function of the number of training samples. In particular, both the denoising and classification problems 
we looked at consider estimators based on histogram partitions. The size of the partition was an increasing 
function of the number of training samples. In this lecture, we will refine our learning methods further 
introduce model selection procedures that automatically adapt to the distribution of the training data, 
rather than basing the model class solely on the number of samples. This sort of adaptivity will play a major 
role in the design of more effective classifiers and denoising methods. The key to designing data-adaptive 
model selection procedures is obtaining useful upper bounds on the estimation error. To this end, we will 
introduce the idea of "Probably Approximately Correct" learning methods. 

7.2 Recap: Method of Sieves 

The method of Sieves underpinned our approaches in the denoising problem and in the histogram classifi- 
cation problem. Recall that the basic idea is to define a sequence of model spaces T\, F2, •••of increasing 



lr This content is available online at <http://cnx.Org/content/ml6282/l.2/>. 
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complexity, and then given the training data {Xi,Yi}f =1 select a model according to 

f n = argminR n (f) . (7.1) 

J KzJ~n 

The choice of the model space T n (and hence the model complexity and structure) is determined completely 
by the sample size n, and does not depend on the (empirical) distribution of training data. This is a major 
limitation of the sieve method. In a nutshell, the method of sieves tells us to average the data in a certain 
way ( e.g., over a partition of X) based on the sample size, independent on the sample values themselves. 
In general, learning basically comprises of two things: 

1. Averaging data to reduce variability 

2. Deciding where (or how) to average 

Sieves basically force us to deal with (2) a priori (before we analyze the training data). This will lead 
to suboptimal classifiers and estimators, in general. Indeed deciding where/how to average is the really 
interesting and fundamental aspect of learning; once this is decided we have effectively solved the learing 
problem. There are at least two possibilities for breaking the rigidity of the method of sieves, as we shall see 
in the following section. 

7.3 Data Adaptive Model Spaces 

7.3.1 Structural Risk Minimization (SRM) 

The basic idea is to select T n based on the training data themselves. Let T\, Ti, ---be a sequence of model 
spaces of increasing sizes/complexities with 

lira inf R{f) = R* . (7.2) 

Let 

fn,k = ar gminRn{f) (7.3) 

j£J~k 

be a function from Tk that minimizes the empirical risk. This gives us a sequence of selected models 

fn i> fn 2) ' ' ' Also associate with each set Tk a value C n .k > that measures the complexity or "size" of the 
set Tk- Typically, C nt k is monotonically increasing with k (since the sets are of increasing complexity) and 
decreasing with n (since we become more confident with more training data). More precisely, suppose that 
the C n ,k chosen so that 

P\ sup\R n (f)-R(f) | >C n .k) <S (7.4) 

\fer k ) 

for some small S > 0. Then we may conclude that with very high probability (at least 1 — 5) the empirical 
risk R n is within C n ^k of R uniformly on the class Tk- This type of bound suffices to bound the estimation 
error (variance) of the model selection process of the form R (/) < R n (/) + C n ,k, and SRM selects the final 
model by minimizing this bound over all functions in [j k >i-^k- The selected model is given by / ~, where 

n,k 

k= argmin{R n I f nk 1 + C„ ife }. (7.5) 

A typical example could be the use of VC dimension to characterize the complexity of the collection of 
model spaces i.e.,C n k is derived from a bound on the estimation error. 
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7.3.2 Complexity Regularization 

Consider a very large class of candidate models T. To each / € T assign a complexity value C n (/). Assume 
that the complexity value is chosen so that 

P I sup\R n (f) -R(f)\>C n (/) J < 5. (7.6) 

This probability bound also implies an upper bound on the estimation error and complexity regularization 
is based on the criterion 

f n = argmin{R n (f) + C n (f)}. (7.7) 

Complexity Regularization and SRM are very similar and equivalent in certain instances. A distinguishing 
feature of SRM and complexity reqularization techniques is that the complexity and structure of the model 
is not fixed prior to examining the data; the data aid in the selection of the best complexity. In fact, the key 
difference compared to the Method of Sieves is that these techniques can allow the data to play an integral 
role in deciding where and how to average the data. 

7.4 Probably Approximately Correct (PAC) learning 

Probability bounds of the forms in (7.4) and (7.6) are the foundation for SRM and complexity regularization 
techniques. The simplest of these bounds are known as PAC bounds in the machine learning community. 

7.4.1 Approximation and Estimation Errors 

In order to develop complexity regularization schemes we will need to revisit the estimation error / approx- 
imation error trade-off. Let f n = argininf^Rn (/) for some space of models T '. 

R[f n )-R* = R\f n )-inf feF R{f) + inf fe:F R(f)-R* (7.8) 



approximation error 



estimation Error 

The approximation error depends on how close /* is close to T ', and without making assumptions, this 
is unknown. The estimation error is quantifiable, and depends on the complexity or size of T '. The error 
decomposition is illustrated in Figure 7.1. The estimation error quantifies how much we can "trust" the 
empirical risk minimization process to select a model close to the best in a given class. 
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Figure 7,1: Relationship between the errors 



Probability bounds of the forms in (7.4) and (7.6) guarantee that the empirical risk is uniformly close to 
the true risk, and using (7.4) and (7.6) it is possible to show that with high probability the selected model 



/„ satisfies 



R /„ - infR(f)<C(n,k) 



(7.9) 



or 



R\f n \- infR(f)<C n (f) 



(7.10) 



7.4.2 The PAC Learning Model 

The estimation error will be small if R I f n ] is close to inf t^pR (/)• PAC learning expresses this as follows. 

We want f n to be a "probably approximately correct" (PAC) model from T '. Formally, we say that f n is e 
accurate with confidence 1 — S, or (e, S) —PAC for short, if 



PUI/J -infR(f)>e\ <S. 



(7.11) 



This says that the difference between R I f n ] and inf t e:F R(f) is greater than e with probability less than 

5. Sometimes, especially in the machine learning community, PAC bounds are stated as, "with probability 

of at least 1 - 8, \R ( f n ) - inf fe:F R (/) | < e" 

To introduce PAC bounds, let us consider a simple case. Let JFconsist of a finite number of models, and 
let \T\ denote that number. Furthermore, assume that minf^jrR (/) = 0. 
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Example 7.1 

T= set of all histogram classifiers with M bins => \T\ = 2 



M 



minR(f) = => 3 a classifier in JF that has a zero probability of error (7-12) 

Theorem 7.1: 

Assume \T\ < oo and min fe yrR(f) = 0, where R (f) = P(f(X)^Y). Let /„ = 
argminf£jrR„ (/), where R n (/) = ± Yh=i 1 {f(x i ) 7 tY i }- Then for every n and e > 0, 

P U If A > ej < |^|e— = ,5. (7.13) 

Proof: 

Since miiijgyfl (/) = 0, it follows that R n /„ = 0. In fact, there may be several / S T such 

that R n (/) = 0. Let Q = {/ : i?„ (/) = 0}. 

P(i?(/J>e) < ^(U /ea {^(/)>e} 



= p(y f ^ :R{f)>e {Rn(f) = o}) (7 " 14) 

< E /e ^( / )> £j p(i?™(/) = o N 

The last inequality follows from the fact that if R (f) = P (/ (X) ^ Y) > e, then the probability 
that n i.i.d. samples will satisfy / (X) = Y is less than or equal to (1 — e) n . Note that this is simply 

the probability that R n (/) = ^ Yl7=i ^-{fix^Yi} = 0- Finally apply the inequality 1 — x < e~ x to 
obtain the desired result. 

Note that for n sufficiently large, 5 = \!F\e~ ne is arbitrarily small. To achieve a (e, S)-PAC bound 
for a desired e > and S > we require at least n = — — ' ° 9 training examples. 

Corollary 7.1: 

Assume that \T\ < oo and minf^^R(f) = 0. Then for every n 



E 
Proof: 



< 1 + < ° g| *l (7.15) 



R\fn 

Recall that for any non-negative random variable Z with finite mean, E [Z] = L P (Z > t) dt. 
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This follows from an application of integration by parts. 



E 



R\fn 



f °°P R f n )>t)dt 



Io P \ R \fn > * ]dt + CP [R \f n > t) dt, for any u > 



(7.16) 



< 



u+\T\f~e- nt dt 



Minimizing with respect to u produces the smallest upper bound with u = — — - 



Chapter 8 

Chernoff 's Bound and Hoeffding's 
Inequality 1 

8.1 Introduction 

8.1.1 Motivation 

In the last lecture (Chapter 7) we consider a learning problem in which the optimal function belonged to a 
finite class of functions. Specifically, for some collection of functions .Fwith finite cardinality \T\ < oo, we 
have 

minR(f) = 0=> f* 6f. (8.1) 

This is almost always not the situation in the real- world learning problems. Let us suppose we have a finite 
collection of candidate functions T. Furthermore, we do not assume that the optimal function /*, which 
satisfies 

R(f*) = infR(f) (8.2) 

/ 

where the inf is taken over all measurable functions, is a member of T. That is, we make few, if any, 
assumptions about /*. This situation is sometimes termed as Agnostic Learning. The root of the word 
agnostic literally means not known. The term agnostic learning is used to emphasize the fact that often, 
perhaps usually, we may have no prior knowledge about /*. The question then arises about how we can 
reasonably select an / s T in this setting. 

8.1.2 The Problem 

The PAC style bounds discussed in the previous lecture (Chapter 7), offer some help. Since we are selecting 

a function based on the empirical risk, the question is how close is R n (/) to R(f)\/f G T. In other words, 
we wish that the empirical risk is a good indicator of the true risk for every function in T. If this is case, 
the selection of / that minimizes the empirical risk 

/„= argminR n (/) (8.3) 



1 This content is available online at <http://cnx.Org/content/ml6264/l.2/>. 
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should also yield a small true risk, that is, R I /„ ] should be close to minf^y^R(f). Finally, we can thus 
state our desired situation as 

plmax\R n (f)-R(f)\>e\ <S, (8.4) 

for small values of e and 6. In other words, with probability at least 1 — 5, | R n (f) — R{f) \ > £, 
V/ € T. In this lecture, we will start to develop bounds of this form. First we will focus on bounding 

P 1 1 Rn (/) - R (/) I > £ J for one fixed /ef. 

8.2 Developing Initial Bounds 

To begin, let us recall the definition of empirical risk for {Xi,Yi}f =1 be a collection of training data. Then 
the empirical risk is defined as 

n 

R n (f) = -Y, £ (f( X *hYi). (8.5) 

i=l 

Note that since the training data {Xi,Yi}f =1 are assumed to be i.i.d. pairs, the terms in the sum are i.i.d 
random variables. 
Let 

Li = £(f(Xi),Yi). (8.6) 

The collection of losses {Lj}™ =1 is i.i.d according to some unknown distribution (depending on the un- 
known joint distribution of (X,Y) and the loss function). The expectation of Li is E \l (/ (JQ) , Y^)] = 
E [£ (/ (X) , Y)} = R (/), the true risk of /. For now, let's assume that / is fixed. 



E 



RnU) 



1 n 1 n 



n ' — ' n 

i=l i=l 



We know from the strong law of large numbers that the average (or empirical mean) R n (/) converges 

almost surely to the true mean R (/) . That is, R n (/) — > R(f) almost surely asn-» oo. The question is 
how fast. 

8.3 Concentration of Measure Inequalities 

Concentration inequalities are upper bounds on how fast empirical means converge to their ensemble coun- 
terparts, in probability. The area of the shaded tail regions in Figure 1 is P | R n (/) — R (f) | > e ] . We 
are interested in finding out how fast this probability tends to zero as rn oo. 
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R(f -• lii'i 



Figure 8.1: Distribution of R n (/) 



At this stage, we recall Markov's Inequality. Let Z be a nonnegative random variable. 



Take 



E[Z] 



> 



^P{Z>t) < 
=► P(Z 2 > t 2 ) < 



J °° zp (z) dz 

/o z p ( z ) dz + ir z p ( z ) dz 

+ £ J. zp(z)dz 
tP(Z > i) 

E[Z\ 
t 

e[z 2 } 
t 2 



Z=\R n (f)-R(f)\ and t = e 



(8.9) 



P |i?„(/)-^(/)|>£ < 



l«n (/)-«(/)!' 



< 



var i? n (/) 



Er=i var (^r) 

e 2 
vaj(l(X),y) 



(8.10) 



So, the probability goes to zero at a rate of at least n 1 . However, it turns out that this is an extremely 
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loose bound. According to the Central Limit Theorem 

Rn{f) = -Y J U^N [R(f),^k) as n^oo 
in distribution. This suggests that for large values of n, 



.11) 



P\\R n (f)-R(f)\>e\*0{e >< j . (8.12) 

That is, the Gaussian tail probability is tending to zero exponentially fast. 

8.4 Chernoff's Bound 

Note that for any nonnegative random variable Z and t > 0, 

E \e sZ ] 
P{Z >t) = P (e sZ > e st ) < — L-J-, Vs > by Markov's inequality. (8.13) 

Chernoff's bound is based on finding the value of s that minimizes the upper bound. If Z is a sum of 
independent random variables. For example, say 



Z = J2 (t (/ (*i) ,Yi)-R (/)) = n( Rn (f)-R (f) J (8.14) 

»=i ^ ' 

then the bound becomes 

P (Er=i ( L i - # I^]) > t) < e ~ stE [e^r =1 (ii-s[Li])] < (8 . 15) 

e~ st nr=i -^ [e s ^ Li_ ' B [ Li ^] , from independence. 

Thus, the problem of finding a tight bound boils down to finding a good bound for E [s s ( Li_£ [ L '])]. 
Chernoff ('52), first studied this situation for binary random variables. Then, Hoeffding ('63) derived a more 
general result for arbitrary bounded random variables. 

8.5 Hoeffding's Indequality 

Theorem 8.1: Hoeffding's Inequality 

Let Z\, Z%, ..-, Zn be independent bounded random variables such that Zi e [«», bj\ with probability 

1. Let S n = J^r=i ^i- Then for any t > 0, we have 



P(\S n -E[S n }\>t)<2e e?-i (»*-»«)". (8.16) 

Proof: 

The key to proving Hoeffding's inequality is the following upper bound: if Z is a random variable 
with E[Z] = and a < Z < b, then 

E[e sZ ] <e^^. (8.17) 

This upper bound is derived as follows. By the convexity of the exponential function, 

e sz < z _Z^L e sb + £ll£ e «» } for a<z<b. (8.18) 

b — a b — a 




Figure 8.2: Convexity of exponential function. 



Thus, 



E [e sZ ] 



E\f=* e° b + E ^f|e- 



b sa a „sb 

b—a b—a 



( "" - 7— < '" , since E [Z] = 



(1-0+ 0e< b ~^) e -° s{b -^ , where 



b—a 



Now let 
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.19) 



u = s{b-a) and define 4> (u) = —6u + log (1 - 8 + 9e u ) . 
Then we have 

E [e sZ ] < (l-6 + 6e s(b - a A e - 9s{b - a) = e 0(u) . 
To minimize the upper bound let's express <j>(u) in a Taylor's series with remainder 



(8.20) 



■ 21) 



4> (u) = (f)(0) + u<j) (0) H <f> (v) for some v G [0, u] 



.22) 



<f>'(u) 
f (u) 



9e" 



0' (w) = 



()<■" 



l-8+8e u (l-e+Se«) 2 



1 



Now, (u) is maximized by 



l-0+8e" ^ l-0+9e" 

P(l-P) 



6>e" 1 ,„. , 1 
= - =4. 6 (u) < -. 



.23) 



,24) 
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So, 

, x u 2 s 2 (b-af 
<Hu)<- = K g ^ (8.25) 

^E[e sZ ] Ke 11 ^. (8.26) 

Now, we can apply this upper bound to derive Hoeffding's inequality. 

P(S n -E[S n ]>t) < e- st YltiE[e s ^~ E ^} 

s 2 (b,-aA 2 

e - st e s2 SI'=i {bi ~s i] (8.27) 

-2t 2 

e Er =1 (^-,) 2 

by choosing s = ^i — # ^ 

-2t 2 

Similarly, P (E [S n ] — S n > t) < e E ?=i ( 6 i~ a i) . This completes the proof of the Hoeffding's theorem. 

Example 

Application 

Let Zi = lf(x t )^Yi — R{f) > as m the classification problem. Then for a fixed f, it follows from 
Hoeffding's inequality (i.e., Chernoff's bound in this special case) that 

p(ii (/>-*(/>!,.) - nii*-*^.) 

= P(|S n -£;[5 n ]|>ne) . (8.28) 

2(iie) 2 

= 2e- 2ne2 

Now, we want a bound like this to hold uniformly for all / s T. Assume that J 7 is a finite 
collection of models and let \T\ denote its cardinality. We would like to bound the probability that 

maxf^^\ R n (/) — R (/) | > e. Note that the event 

{max\ Rn(f)-R(f)\>e} = { \J \ R n (/) -R(f)\> e}. (8.29) 

Therefore 



pimax\R n (f)-R(f)\>e\ = P i\J fer \ R n (f) - R(f)\ > s\ < (8.30) 

E/eF P ( I R n (/) - R (/) I > e ) , the "union of events" bound < 

2|F|e~ 2n£ , by Hoeffding's inequality. 

Thus, we have shown that with probability at least 1 — 2\F\e~ 2ne , V/ € T 

\R n (f)-R(f)\<£. (8.31) 
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And accordingly, we can be reasonably confident in selecting / from T based on the empirical risk 
function R n . 
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Chapter 9 

Classification Error Bounds 

9.1 Recap: Classifier design 

Given a set of training data {Xi, Yi}™ =1 and a finite collection of candidate functions T ', select f n e T that 
(hopefully) is a good predictor for future cases. That is 

f n = argminRn (/) (9.1) 

where R n (/) is the empirical risk. For any particular / g T, the corresponding empirical risk is defined as 

I n 

Rn(f) = -Y, 1 {f(X^Y t} - (9-2) 



n 



9.2 HoefFding's inequality 

Hoeffding's inequality (Chernoff's bound in this case) allows us to gauge how close R n (/) is to the true risk 
of /, R{f), in probability 

p(\Rn(f)-R(f)\>s^<2e- 2 ^. (9.3) 

Since our selection process involves deciding among all / g T, we would like to gauge how close the 
empirical risks are to their expected values. We can do this by studying the probability that one or more of 
the empirical risks deviates significantly from its expected value. This is captured by the probability 



Note that the event 



P[max\R n (f)-R(f)\>e). (9.4) 



max\R n (f)-R(f)\>e (9.5) 



lr This content is available online at <http://cnx.Org/content/ml6265/l.2/>. 
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is equivalent to union of the events 

\J{\Rn(f)-R(f)\>e}. (9.6) 

Therefore, we can use Bonferonni's bound (aka the "union of events" or "union" bound) to obtain 



P[max\R n (f)-R(f)\>e) = P [ \J f&F \R n (/) - R(f) | > : 



- Ef^P(\Rn(f)-R(f)\>ej (9 . 7) 



< E^2e- 2 - 



2\F\e 



-2ne^ 



where | J 7 ! is the number of classifiers in T '. In the proof of Hoeffding's inequality we also obtained a one-sided 
inequality that implied 

pU(f)-Rn(f)>e) <e~ 2n " 2 (9.8) 

and hence 

P (max R (/) - R n (/) > e J < | J-| e - 2ne2 . (9.9) 

We can restate the inequality above as follows, For all / e T and for all S > with probability at least 1 — 5 



V 2n 

This follows by setting 5 = \J r \e~ 2ne and solving for e. Thus with a high probability (1 — 6), the true risk for 
all / g T is bounded by the empirical risk of / plus a constant that depends on 5 > 0, the number of training 
samples n, and the size T. Most importantly the bound does not depend on the unknown distribution Pxy- 
Therefore, we can call this a distribution-free bound. 

9.3 Error Bounds 

We can use the distribution-free bound above to obtain a bound on the expected performance of the 
minimum empirical risk classifier 

fn = argminR n (/) . (9.11) 



We are interested in bounding 



E 



R\fn 



minR(f) (9.12) 



the expected risk of f n minus the minimum risk for all / e T. Note that this difference is always non-negative 
since f n is at best as good as 

/* = argminR(f) . (9.13) 

s t^~ 



Recall that V/ € T and V<5 > 0, with probability at least 1 — 5 
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R(f)<R n (f) + C(T,n,S) 



where 



C{T,n,$) 



log\T\ + log(l/6) 



2n 



(9.14) 



(9.15) 



In particular, since this holds for all / e T including f r 



R\f n ) <Rn[f n ]+C(T,n,8) 



(9.16) 



and for any other / € T 



R\fn\ <Rn(f)+C(f,n,6) 



(9.17) 



since R n \ f n ) < Rn (/)V/ € J 7 . In particular, 



R\f n \ <Rn(f*) + C(T,n,S) 



(9.18) 



where /* = argmirif e jrR(f). 

Let £1 denote the set of events on which the above inequality holds. Then by definition 



We can now bound E 



R[fr 



P{fl)>l-5. 
R{f*) as follows 



E 



R\fr, 



R(F) = E 



R[fn) -Rn(f*) + Rn(F)-R(n 



E 



R\fn\-Rn (/* 



since E 



Rn(f* 



R(f*). The quantity above is bounded as follows. 



(9.19) 



(9.20) 



E 



E 



R\fn] -Ruin 



R[fn) -Rn(f*)\tt 



E 



R\fn)-Rn(r)\n 



P(fi) 



p (n) < e 



R[fn)-Rn(r)\n 



(9.21) 
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since P (O) < 1, 1 - P (O) < 8 and R /„ - fl„ (/*) < 1 



£ 



« /n -Bn(/*)|n 





r (~ \ ~ (~ \ 




< £ 


RUn) -RnU n 


\n 


< 


C{T,n,5) 





Thus 



So we have 



E 



R[fn)-Rn(n 



< C{T,n,5) + 5. 



E 



R\fr, 



mmR (/) < J iog\r\ + iog(i/6) +SfW6> Q 



In particular, for S = \/l/n, we have 



E 



R\fr, 



minR ( f) < 



log\F\+logn . l_ 

2n Jn 



log\f\+logn+2 



since y/x 



+ ^y<V2^/x-Ty, Vi,i/>0 



(9.22) 



(9.23) 



(9.24) 



(9.25) 



9.4 Application: Histogram Classifier 



Let T be the collection of all classifiers with M equal volume cells. Then \T\ = 2 M , and the histogram 
classification rule 



/„ = argmin - V 1 



f£F \ n 



{f(Xi)jtYi} 



i=\ 



satisfies 



E 



R f. 



minR ( f) < 



Mlogl + 2 + logn 



which suggests the choice M = log 2 n (balancing Mlog2 with logn), resulting in 

E Z?|/„| - ? JnR(f) = o( ] [ J °p). 



(9.26) 



(9.27) 



(9.28) 



Chapter 10 

Error Bounds in Countably Infinite 
Spaces 1 

10.1 Introduction 

In the last lecture (Chapter 9), we studied bounds of the following form: for any S > 0, with probability at 
least 1 — S, 



niD^PAD- ^ + tMi) , v/6jr (1(U) 



which led to upper bounds on the estimation error of the form 

R\fn 



, log\T\ + log (n) + 2 
minR(f) ; d -^ ^ . 10.2 

feF V n 



The key assumptions made in deriving the error bounds were: 

(i): bounded loss function 

(ii) : finite collection of candidate functions 

The bounds are valid for every Pxy an d are called distribution-free . 

10.2 Deriving Bounds for Countably Infinite Spaces 

In this lecture we will generalize the previous results in a powerful way by developing bounds applicable to 
possibly infinite collections of candidates. To start let us suppose that T is a countable, possibly infinite, 
collection of candidate functions. Assign a positive number c(/) to each / s T, such that 

J2 e" c(/) < oo. (10.3) 

The numbers c(/) can be interpreted as 

(i) : measures of complexity 
(ii): -log of prior probabilities 
(iii): codelengths 

1 This content is available online at <http://cnx.Org/content/ml6271/l.2/>. 
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In particular, if P(/) is the prior probability of / then 

e -(-logp(f)) =p (f) 

so c(f) = —logp(f) produces 



£ e -(/) = 5> (/) = !. 



Now recall Hoeffding's inequality. For each / and every e > 



P(R(f)-Rn(f)>e) <e 



-2ne^ 



or for every 6 > 



P\R(f)-Rn(f)>^ l -^^\ <S. 

Suppose 6 > is specified. Using the values c(/) for /Gf, define 

<5(/) = e~ c ^<5. 



Then we have 



/ 



R(f)-Rn(f)>\ 

\ 

Furthermore we can apply the union bound as follows 



log (stf)) 



2// 



<*(/)■ 



P (sup{R (/) - R n (/) - v^g^} > o) < P (y f€ r R (f) - R„ (f) > V /] 4P) 



< Ef^P[R(f)-Rn(f)> V 2„ 



So for any 5 > with probability at least 1 — 5, we have that \ffeJ 7 

R(f) < Rn{f) + \l l ^i^ 

= Rn(f) + ^W^ 



Special Case 

Suppose T is finite and c(/) = log\T\ V/ € JF. Then 



5> 



c(/) 






iog|.F| 






(10.4) 
(10.5) 

(10.6) 

(10.7) 
(10.8) 

(10.9) 



Mg£p\ • ( 10 - 10 ) 



(10.11) 



(10.12) 



and 



8(f) 



\T\ 



(10.13) 
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which implies that for any 6 > with probability at least 1 — 6, we have 



R(f)<Rn(f) + \ 



log\T\ + log [-^ 
2n 



V/eJP. 



(10.14) 



Note that this is precisely the bound we derived in the last lecture (Chapter 9). 

Choosing c(/) 

The generalized bounds allow us to handle countably infinite collections of candidate functions, but we 
require that 



E 



e C V> < oo. 



(10.15) 



Of course, if c (/) = —logp (/) where p (/) is a proper prior probability distribution then we have 

Y, e~ c(/) = 1. (10.16) 

However, it may be difficult to design a probability distribution over an infinite class of candidates. The 
coding perspective provides a very practical means to this end. 

Assume that we have assigned a uniquely decodable binary code to each / g T, and let c(/) denote the 
codelength for /. That is, the code for / is c(/) bits long. A very useful class of uniquely decodable codes 
are called prefix codes . 

Definition 10.1: Prefix Code 

A code is called a prefix code if no codeword is a prefix of any other codeword. 
Example: From Cover & Thomas '91 

Consider an alphabet of symbols, say A, B, C, and D and the codebooks below 



Symbol 


Singular 
Codebook 


Nonsingular But Not 
Uniquely Decodable 


Uniquely Decodable But 
Not a Prefix Code 


Prefix Code 


A 








10 





B 





010 


00 


10 


C 





01 


11 


110 


D 





10 


110 


1110 



Figure 10.1 



In the singular codebook we assign the same codeword to each symbol - a system that is obviously 
flawed! In the second case, the codes are not singular but the codeword 010 could represent B or 
CA or AD. Hence it is not a uniquely decodable codebook. 

The third and fourth cases are both examples of uniquely decodable codebooks, but the fourth 
has the added feature that no codeword is a prefix of another. Prefix codes can be decoded from 
left to right since each codeword is "self-punctuating" - in this case with a zero to indicate the end 
of each word. 

To design a uniquely decodable codebook in general is as challenging as the problem of selecting 
c(/) to satisfy 



Y, e~ c(/) < oo. 
/e.y 



(10.17) 
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However, prefix codes can often be easily designed or specified and they are inherently decodable. 
Moreover, prefix codes satisfy an important inequality called the Kraft Inequality . 



10.3 The Kraft Inequality 

For any binary prefix code, the codeword lengths c\, C2, ... satisfy 



XV* <i. 



(10.18) 



Conversely, given any c\, c<2, ... satisfying the inequality above we can construct a prefix code with these 
codeword lengths. We will prove this result a bit later, but now let's see how this is useful in our learning 
problem. 

Assume that we have assigned a binary prefix codeword to each / s T , and let c(/) denote the bit-length 
of the codeword for /. Set 5 (/) = 2~ c ^5. Then 



< £ /e ^(/) = E /e ^2- 

This implies that for any 5 > with probability at least 1 — 5 we have \ffeJ 7 



Ifgliffi 
WS = 6 



(10.19) 



R(f) < 



c(f)log2+log(j;) 



Rn(f) 



(10.20) 



Application 

Let T\, T\, ... be a sequence of finite sets of candidate functions with \!Fi\ < \!Fi\ < ... We can design 
prefix codes as follows. Use the codes 0, 10, 110, 1110, ... to encode the subscript i in \Ti\. For each class 
\Ti\, construct a set of binary codewords of length \log 2 \F\\ to uniquely encode each function in T{. Then, 
encode any given function / by first using the code for i corresponding to the smallest Ti that / belongs to, 
followed by the length [/o(? 2 |J r |] codeword for / e T^. This is a prefix code. 

Example 10.1: Histogram Classifiers 

X=[0,l] d , Y={0,1}. Let !Fk, k=l, 2, ... denote the collection of histogram classification rules with 
k equal volume bins. We can use the following codebook for the index k. 



k 


Prefix Code 


1 





2 


10 


3 


110 


4 


1110 















Figure 10.2 
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And follow this codeword with k = log 2 \^Fk\ bits to indicate which of the 2 k possible histogram 
rules is under consideration. Thus for any / e Tk for some k > 1 there is a prefix code of length 



c(f) = k + k = 2k bits. 
It follows that for any <5 > with probability at least 1 — 5 we have V/ € Ufc>i-?"fc 



(10.21) 



R(f)<Rn(f) + 



2k f log2 + log (|) 
2n 



(10.22) 



where kf is the number of bins in histogram corresponding to /. Contrast with the bound we had 
for the class of m bin histograms alone: with probability > 1 — 5, V/ € T m 



R (/)<&>(/) + 



\ 



mlog2 + log y-^ 
2n 



(10.23) 



Notice the bound for all histograms rules is almost as good as the bound for only the m-bin rules. 
That is, when kf = in the bounds are within a factor of \[2. On the other hand, the new bound is 
a big improvement, since it also gives us a guide for selecting the number of bins. 

Proof 10.1: Proof of the Kraft Inequality 

We will prove that for any binary prefix code, the codeword lengths c\, ci, ... satisfy J2k>i^ Ck — ■'■■ 
The converse is easy to prove also, but it not central to our purposes here (for a proof, see Cover 
& Thomas '91). Consider a binary tree like the one shown below. 



Root 




000 



001 



Figure 10.3 



The sequence of bit values leading from the root to a leaf of the tree represents a codeword. 
The prefix condition implies that no codeword is a descendant of any other codeword in the tree. 
Let c max be the length of the longest codeword (also the number of branches to the deepest leaf) 
in the tree. 
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Consider a leaf i in the tree at level Cj. This leaf would have 2 Cmax ~ Ci descendants at level 
Cmax- Furthermore, for each leaf the set of possible descendants at level c max is disjoint (since no 
codeword can be a prefix of another). Therefore, since the total number of possible leafs at level 
Cmax is 2 Cmax , we have 

Y^ 2 C —- Ci < 2 C — => J2 2 ~ Ci ^ 1 (10.24) 

iSleafs iGleafs 

which proves the case when the number of codewords is finite. 

Suppose now that we have a countably infinite number of codewords. Let bi b 2 ... b c . be the 
ith codeword and let 



I> 



2~ j (10.25) 



be the real number corresponding to the binary expansion of the codeword. We can associate 
the interval [r t , r t + 2~ Ci ) with the ith codeword. This is the set of all real numbers whose binary 
expansion begins with b x b 2 ... b c .. Since this is a subinterval of [0,1], and all such subintervals 
corresponding to prefix codewords are disjoint, the sum of their lengths must be less than or equal 
to 1. This proves the case where the number of codewords is infinite. 



Chapter 11 

Complexity Regularization 1 



11.1 Review: PAC Bounds 

Consider a finite collection of models T ', and recall the basic PAC bound: for any 5 > 0, with probability at 
least 1 — 5 

R{f)<Rn{f) + \l lo9m+ ^ 9(1/S) , V/e^ (ll-l) 

where 

Rn(f) = ££?=i*(/(*i),*i) (n2) 

R(f) = E[t(f(X),Y)] 
and the loss £ is assumed to be bounded between and 1. Note that we can write the inequality above as: 

R(f)<Rn(f) + \lW 1 (1L3) 

Letting 5f = t4t, we have: 



i?(/)<i?«(/) + V / ^T M (1L4) 

This is precisely the form of Hoeffding's inequality, with Sf in place of the usual 5. In effect, in order to 
have Hoeffding's inequality hold with probability 1 — 5 for all / e T, we must distribute the "<5-budget" or 
"confidence-budget" over all / S T (in this case, evenly distributed): 

However, to apply the union bound, we do not need to distribute 6 evenly among the candidate models. 
We only require: 

£ /e ^/ = 8 (11.6) 

So, if p(f) are positive numbers satisfying J2f^y^P(f) = 1> then we can take 5f = p(f)5. This provides 
two advantages: 



lr This content is available online at <http://cnx.Org/content/ml6266/l.2/>. 
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1. By choosing p (/) larger for certain /, we can preferentially treat those candidates 

2. We do not need T to be finite and we only require ^2r e jrp{f) = 1 

Prefix codes are one way to achieve this. If we assign a binary prefix code of length c(/) to each / g T, 
then the values p(f) = 2~ c ^ satisfy ^2t e yrp{f) < 1 according to the Kraft inequality. 
The main point of this lecture is to examine how PAC bounds of the form w.p. > 1 — 5 



can be used to select a model that comes close to achieving the best possible performace 



(11.7) 



infR(f) 



(lli 



Let f n be the model selected from T using the training data {Xi,Yi}f =1 . We will specify this model in a 
moment, but keep in mind that it is not necessarily the model with minimum empirical risk as before. We 
would like to have 



E 



R\fn 



infR(f) 



as small as possible. First, for any 5 > 0, define 



where 



Then w.p. > 1 - d 



fn = argmin{R n {f) + C{f,n,5)} 



C(/,n,5) = ^»hW£! 



(11.9) 



(11.10) 



(11.11) 



and in particular, 



R(f)<R n (f) + C(f,n,5) , V/6f 



R\fn\ <Rn\fn\ +C\f n ,n,S 



(11.12) 



(11.13) 



so, by the definition of / , V/ € T 



R\fn\ <Rn(f) + C(f,n,6) 



(11.14) 



We will make use of the inequality above in a moment. First note that V/ € T 



E 



R\fn 



R(f) = E 



The second term is exactly 0, since E 



Rn{f) 



R\fn\-Rn (!) 



«(/)■ 



E 



R n (f)-R(f) 



(11.15) 
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Now consider the first term E 



R\fn\-Rn (/) 



Let fl be the set of events on which 



R\fn\ <Rn(f)-C(f,n,S), V/e^ 



(11.16) 



From the bound above, we know that P (O) > 1 — 8. Thus, 



E 



E 



R\fn\ -Rn(f) 



R[fn) -Rn(f)\& 



E 



R[fn] -Rn(f)\n 



P{Q) + (11.17) 



(1-P(fi)) 



< 



C(f,n,5) + 



5 (since < R, R< 1, P (ft) < 1 and 1 - P (ft) < 5 J 
C(/) 'tr |WW + ^ ^ setting^ ^ 



c(f)log2+log(l/S) 
2n 



We can summarize our analysis with the following theorem. 

Theorem 11.1: Complexity Regularized Model Selection 

Let Jbea countable collection of models, and assign a positive number c (/) to each / e T such 
that 5^fe;p-2~ c ^ < 1. Define the minimum complexity regularized risk model 



/„ = ar 9 min{R n (f) + J cW ° g2 2 ^ l °n 



Then, 



This shows that 



E 



R\fn 



f&F 



< inf{R(f) + 



c(f)log2+\logn 






(11.18) 



(11.19) 



is a reasonable surrogate for 



Rn (/) 



R(f) 



c(f)log2+^logn 
2n 



c(f)log2+±logn 
2n 



(11.20) 



(11.21) 



Example: Histogram Classifiers 

Let X = [0, 1] be the input space and y = {0, 1} be the output space. Let T^, k = 1, 2, ... denotes 
the collection of histogram classification rules with k equal volume bins. One choice of prefix code 
for this example is: k = 1 => code = 0, k = 3 => code = 10, k = 3 => code = 110 and so on .... 
Then, if first code is corresponding to k => / e Tk, followed by k = log 2 \^Fk\ bits to indicate which 
of the 2 fc histogram rules in Tk ls under consideration, we have 



f e T k => c(f) = 2k bits 



(11.22) 
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Let f n be the model that solves the minimization i.e. 



min{minR n (/) + y 
fc>i Ve-F* v 



2klog2-\- ^logn -. 
2n J 



That is, for each k, let 



(11.23) 



,(k) 



argminRn (f) 



Then select the best k according to 



~{kV 



k = argmin{R n f r 

k>l 



2klog2-\- ^logn -. 
2n J 



(11.24) 



(11.25) 



and set 



Then, 



J n J n 



(11.26) 



E 



R[fr 



< ln f{ T nR{f) + ^ kl °^ n + ^} 

k>\ iZ-r* " v 



(11.27) 



It is a simple exercise to show that if d = 2 and the Bayes decision boundary is a 1-d curve, then 
by setting k = y/n and selecting the best / from Tr^ we have 



E 



R\fr, 



O (n- 1 / 4 ) 



(11.28) 



note: The complexity regularized classifier f n adaptively achieves this rate, without user inter- 
vention. 



Chapter 12 

Decision Trees 1 

12.1 Minimum Complexity Penalized Function 

Recall the basic results of the last lectures: let X and y denote the input and output spaces respectively. 
Let X g X and Y € X be random variables with unknown joint probability distribution Pxy- We would like 
to use X to "predict" Y. Consider a loss function < £(3/1,3/2) < 1, Vj/1,3/2 € y. This function is used to 
measure the accuracy of our prediction. Let J 7 be & collection of candidate functions (models), f : X —> y. 
The expected risk we incur is given by R (/) = Exy [£ (/ (X) , Y)}. We have access only to a number of i.i.d. 

samples, {Xj, ^}™ =1 . These allow us to compute the empirical risk R n (/) = - Y^i=i ^ (/ (-^*) > ^»)- 

Assume in the following that T is countable. Assign a positive number c (/) to each / S T such that 

Sfcjr^ -0 ^-'-' < 1. If we use a prefix code to describe each element of T and define c(/) to be the codeword 

length (in bits) for each / e T, the last inequality is automatically satisfied. 
We define the minimum complexity penalized estimator as 



l c{f)log2+\logn. 



/„ = argmin{R n (/) + W ' KJ ' »— 2 }. (12.1) 



As we showed previously we have the bound 

E RUr, 

The performance (risk) of f n is on average better than 




^(f)log2+^logn 1 , 

<rnin{R(f) + d 9 - * + -=}. (12.2) 



U{f*)log2+\logn t 1 



R W + \I 2n + ^ (12 - 3) 



where 



c(f)log2+ \logn. 



f* n = argrmn{R(f)+d \ n ) ■ ( 12 - 4 ) 



If it happens that the optimal function, that is 



f* = arg min R{f) , (12.5) 

/ measurable 



1 This content is available online at <http://cnx.Org/content/ml6287/l.2/>. 
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is close to an / e T with a small c(/), then f n will perform almost as well as the optimal function. 

Example 12.1 

Suppose /* e T, then 



E 



R\fn 



<R(f* 



c{f*)log2+ \logn 
2~n 



Furthermore if c(/*) = O {logn) then 



£" 



that is, only within a small O 



R\f, 



logn 



<R(f 



O 



logn 



(12.6) 



(12.7) 



offset of the optimal risk. 



R[f, 



In general, we can also bound the excess risk E 

R 
By subtracting R* (a constant) from both sides of the inequality 



R* , where R* is the Bayes risk, 



inf R(f). 

f measurable 



E 



R\f, 



< min\R ( f) 



c(f)log2+ \logn 
2n 



we obtain 



E 



R\fr 



R* < min{R(f) - R* + 



c(f)log2+ \logn 
2n 



\fn 



}■ 



(12i 



(12.9) 



(12.10) 



Note that two terms in this upper bound: R (f) — R* is a bound on the approximation error 
of a model /, and remainder is a bound on the estimation error associated with /. Thus, we 
see that complexity regularization automatically optimizes a balance between approximation and 
estimation errors. In other words, complexity regularization is adaptive to the unknown tradeoff 
between approximation and estimation. 



12.2 Classification 

Consider the particularization of the above to a classification scenario. Let X = [0, 1] , 3^ = {0, 1} and 
V, y) = 1 ~ - Then R (/) = E XY [l {/( x)^y } ] = P (/ (X) + Y). The Bayes risk is given by 

R*= inf R(f). (12.11) 

/ measurable 

As it was observed before, the Bayes classifier (i.e., a classifier that achieves the Bayes risk) is given by 

1, P{Y = l\X = x)>\ 



f* (x) = { 



0, P{Y=l\X = x)<\ 



(12.12) 



This classifier can be expressed in a different way. Consider the set G* = {x : P (Y = 1\X = x) > 1/2}. 
The Bayes classifier can written as /* (x) = l{ xe G*}- Therefore the classifier is characterized entirely by the 
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set G* , if X e G* then the "best" guess is that Y is one, and vice- versa. The boundary of this set corresponds 
to the points where the decision is harder. The boundary of G* is called the Bayes Decision Boundary. 
In Figure 12.1(a) this concept is illustrated. If r\ (x) = P (Y = 1\X = x) is a continuous function then the 
Bayes decision boundary is simply given by {x : P (Y = 1\X = x) = 1/2}. Clearly the structure of the 
decision boundary provides important information on the difficulty of the problem. 



X = [0, If 



f*(x) = 




Bayes Decision 
Boundary 
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(b) 



Figure 12.1: (a) The Bayes classifier and the Bayes decision boundary ; (b) Example of the i.i.d. 
training pairs. 
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12.2.1 Empirical Classifier Design 

Given n i.i.d. training pairs, {Xi, Yi}" =l , we want to construct a classifier f n that performs well on average, 



i.e., we want E 



R\fr, 



as close to R* as possible. In Figure 12.1(b) an example of the i.i.d. training 



pairs is depicted. 

The construction of a classifier boils down to the estimation of the Bayes decision boundary. The 
histogram rule, discussed in a previous lecture, approaches the problem by subdividing the feature space 
into small boxes and taking a majority vote of the training data in each box. A typical result is depicted in 
Figure 12.2(a). 

The main problem with the histogram rule is that it is solving a more complicated problem than it is 
actually necessary. We do not need to determine the correct label for each individual box directly (the 
histogram rule is essentially estimating r\ (x)). In principle we only need to locate the decision boundary and 
assign the correct label on either side (notice that the accuracy of a majority vote over a region increases 
with the size of the region). The next example illustrates this. 

Example 12.2: Three Different Classifiers 

The pictures below correspond to the approximation of the Bayes classifier by three different 
classifiers: 
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Histogram Classifier 
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(c) 



Figure 12.2: (a) Histogram classifier ; (b) Linear classifier; (c)Tree classifier. 



The linear classifier and the tree classifier (to be defined formally later) both attack the problem 
of finding the boundary more directly than the histogram classifier, and therefore they tend to 
produce much better results in theory and practice. In the following we will demonstrate this for 
classification trees. 
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12.3 Binary Classification Trees 

Binary classification trees are constructed by a two-step process: 

1. Tree growing 

2. Tree pruning 

The basic idea is to first grow a very large, complicated tree classifier, that explains the the training data 
very accurately, but has poor generalization characteristics, and then prune this tree, to avoid overfitting. 

12.3.1 Growing Trees 

The growing process is based on recursively subudividing the feature space. Usually the subdivisions are 
splits of existing regions into two smaller regions (i.e., binary splits) and usually the splits are perpendicular 
to one of the feature axes. An example of such construction is depicted in Figure 12.3. 




and so on.. 



Figure 12.3: Growing a recursive binary tree (X = [0, l\ z 



Often the splitting process is based on the training data, and is designed to separate data with different 
labels as much as possible. It such constructions, the "splits," and hence the tree-structure itself, are data 
dependent. Alternatively, the splitting and subdivision could be independent from the training data. The 
latter approach is the one we are going to investigate in detail, and we will consider Dyadic Decision Trees 
and Recursive Dyadic Partitions (depicted in Figure 12.4) in particular. 

Until now we have been referring to trees, but did not make clear how do trees relate to partitions. It 
turns out that any decision tree can be associated with a partition of the input space X and vice- versa. In 
particular, a Recursive Dyadic Partition (RDP) can be associated with a (binary) tree. In fact, this is the 
most efficient way of describing a RDP. In Figure 12.4 we illustrate the procedure. Each leaf of the tree 
corresponds to a cell of the partition. The nodes in the tree correspond to the various partition cells that 
are generated through in the construction of the tree. The orientation of the dyadic split alternates between 
the levels of the tree (for the example of Figure 12.4, at the root level the split is done in the horizontal axis, 
at the level below that (the level of nodes 2 and 3) the split is done in the vertical axis, and so on...). The 
tree is called dyadic because the splits of cells are always at the midpoint along one coordinate axis, and 
consequently the sidelengths of all cells are dyadic (i.e., powers of 2). 
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Figure 12.4: Example of Recursive Dyadic Partition (RDP) growing (X — [0, l] 2 



In the following we are going to consider the 2-dimensional case, but all the results can be easily general- 
ized for the (i-dimensional case (d > 2), provided the dyadic tree construction is defined properly. Consider 
a recursive dyadic partition of the feature space into k boxes of equal size. Associated with this partition 
is a tree T. Minimizing the empirical risk with respect to this partition produces the histogram classifier 
with k equal-sized bins. Consider also all the possible partitions corresponding to pruned versions of the tree 
T. Minimizing the empirical risk with respect to those other partitions results in other classifiers (dyadic 
decision trees) that are fundamentally different than the histogram rule we analyzed earlier. 

12.3.2 Pruning 

Let T be the collection of all possible dyadic decision trees corresponding to recursive dyadic partitions of 
the feature space. Each such tree can be prefix encoded with a bit-string proportional to the number of leafs 
in the tree as follows; encode the structure of the tree in a top-down fashion: (i) assign a zero at each branch 
node and a one at each leaf node (terminal node) (ii) read the code in a breadth-first fashion, top-down, 
left-right. Figure 12.5 exemplifies this coding strategy. Notice that, since we are considering binary trees, 
the total number of nodes is twice the number of leafs minus one, that is, if the number of leafs in the tree 
is k then the number of nodes is 2k — 1. Therefore to encode a tree with k leafs we need 2k — 1 bits. 

Since we want to use the partition associated with this tree for classification we need to assign a decision 
label (either zero or one) to each leaf. Hence, to encode a decision tree in this fashion we need 3fc — 1 bits, 
where k is the number of leafs. For a tree with k leafs the first 2fc — 1 bits of the codeword encode the tree 
structure, and the remaining k bits encode the classification labels. This is easily shown to be a prefix code, 
therefore we can use this under our classification scenario. 
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=y 000111011 



Figure 12.5: Illustration of the tree coding technique: example of a tree and corresponding prefix code. 



Let 



fn= argmin{R n (/) + 



'(3k- l)log2+ \logn 
2n 



}• 



(12.13) 



This optimization can be solved through a bottom-up pruning process (starting from a very large initial tree 
To) in O (|To| 2 ) operations, where |To| is the number of leafs in the initial tree. The complexity regularization 
theorem tells us that 



E 



R\fr, 



< rainlR ( f) 



(3k- l)log2 + \logn 
2^ 



} + 



\fn' 



(12.14) 



12.4 Comparison between Histogram Classifiers and Classification 
Trees 

In the following we will illustrate the idea behind complexity regularization by applying the basic theorem 
to histogram classifiers and classification trees (using our setup above). 

Consider the classification setup described in "Classification" (Section 12.2: Classification), with X = 

[0,1] 2 . 



12.4.1 Histogram Risk Bound 

Recall the setup and results of a previous lecture 2 . Let 



T^ = {histogram rules with k 2 bins}. 



(12.15) 



Then \T"\ = 2 e . Let T H = \] k>1 T^ . We can encode each element / of T H with c H (f) = k + k 2 bits, 
where the first k bits indicate the smallest k such that / e T^ and the following k 2 bits encode the labels 
of each bin. This is a prefix encoding of all the elements in T R ' . 



2 The description here is slightly different than the one in the previous lecture. 



We define our estimator as 
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where 



and 



f =f ■ 
J n J n ' 



/„ = argminRn(f), 



fc= argmin{R n \ f n +\ 



(k + k 2 )log2 + \logn 
2n 



}. (12.18) 



(12.16) 



(12.17) 



,H 



Therefore f n minimizes 

Rn(f) 

over all / s T H . We showed before that 



cg(/)^og2+ |logn 
2~n ' 



£ 



>H\ 



« /, 



R* < min{R(f) - R* + 

feF H 



ch (/) log2 + \logn 
2n 



} + 



(12.19) 



(12.20) 



To proceed with our analysis we need to make some assumptions on the intrinsic difficulty of the problem. 
We will assume that the Bayes decision boundary is a "well-behaved" 1-dimensional set, in the sense that 
it has box-counting dimension one (see Appendix "Box Counting Dimension" (Section 12.6: Box Counting 
Dimension)). This implies that, for an histogram with k 2 bins, the Bayes decision boundary intersects less 
than Ck bins, where C is a constant that does not depend on k. Furthermore we assume that the marginal 
distribution of X satisfies Px {A) < K\A\, for any measurable subset A C [0, 1] . This means that the 
samples collected do not accumulate anywhere in the unit square. 
Under the above assumptions we can conclude that 



minR(f)-R* < ^Ck ■ ' h 

far" k z 



k 



Therefore 



E 



,H\ 



R\f, 



R* < CK/k + 



(k + k 2 )log2 + \logn 
2n 



(12.21) 



(12.22) 



We can balance the terms in the right side of the above expression using k = n 1 / 4 (for n large) therefore 



E 



-H\ -i 



R\fn 



R* 



o(n" 1/4 ), 



as n — > oo. 



(12.23) 
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12.4.2 Dyadic Decision Trees 

Now let's consider the dyadic decision trees, under the assumptions above, and contrast these with the 
histogram classifier. Let 



T^ = {tree classifiers with k leafs}. 

Let T T = U/c>i^fe ■ We can prefix encode each element / of T T with ct (/) = 3fc 
before. 
Let 



(12.24) 
1 bits, as described 



where 



~T „ I fc 
J n J n > 



~(fc) 

/„ = argminR n (f), 



(12.25) 



(12.26) 



and 



k= argmin{R n I f n + 
fe>i 



{3k - 1) log! + \logn 
2n 



}. (12.27) 



Hence / minimizes 



over all / e T T . Moreover 



Rn (/) 



ct (/) ^Qg2 + \logn 
2n ' 



S 



« \fr 



R* < min{R(f)- R* 



CT(f)log2+ \logn 
2n 



} + 



(12.28) 



(12.29) 



If the Bayes decision boundary is a 1-dimensional set, as in "Histogram Risk Bound" (Section 12.4.1: 
Histogram Risk Bound), there exists a tree with at most 8Ck leafs such that the boundary is contained in 
at most Ck squares, each of volume 1/fc 2 . To see this, start with a tree yielding the histogram partition 
with k 2 boxes (i.e., the tree partitioning the unit square into k 2 equal sized squares). Now prune all the 
nodes that do not intersect the boundary. In Figure 12.6 we illustrate the procedure. If you carefully bound 
the number of leafs you need at each level you can show that you will have in total less than 8Ck leafs. We 
conclude then that there exists a tree with at most 8Ck leafs that has the same risk as a histogram with 
O (fc 2 ) bins. Therefore, using (12.14) we have 



E 



j\ i 



R\fn 



R* < CK/k + 



{3{8Ck)-l)log2+±logn 1 



(12.30) 



2n y/ri 

We can balance the terms in the right side of the above expression using k = n 1 / 3 (for n large) therefore 

e R[f n \ -ir 



O (V 1 / 3 ) 



as n — > oo. 



(12.31) 
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Figure 12.6: Illustration of the tree pruning procedure: (a) Histogram classification rule, for a partition 
with 16 bins, and corresponding binary tree representation (with 16 leafs), (b) Pruned version of the 
histogram tree, yielding exactly the same classification rule, but now requiring only 6 leafs. (Note: The 
trees where constructed using the procedure of Figure ) 



12.5 Final Comments 

Trees generally work much better than histogram classifiers. This is essentially because they provide much 
more efficient ways of approximating the Bayes decision boundary (as we saw in our example, under reason- 
able assumptions on the Bayes boundary, a tree encoded with O (k) bits can describe the same classifier as 
an histogram that requires O (k 2 ) bits). 

The dyadic decision trees studied here are different than classical tree rules, such as CART or C4.5. 
Those techniques select a tree according to 



k= argmin{R n I /„ I +ak}, 

k>l 



(12.32) 



for some a > whereas ours was roughly 



.W N 



k= argmin{R n \ /„ 



*Vk}, 



(12.33) 



for a ~ 
or C4.5 



£ a . The square root penalty is essential for the risk bound. No such bound exists for CART 
Moreover, recent experimental work has shown that the square root penalty often performs better 
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in practice. Finally, recent results show that a slightly tighter bounding procedure for the estimation error 
can be used to show that dyadic decision trees (with a slightly different pruning procedure) achieve a rate of 



E 




R* = o(n~ 1/2 \ asn->oo, (12.34) 



which turns out to be the minimax optimal rate (i.e., under the boundary assumptions above, no method 
can achieve a faster rate of convergence to the Bayes error). 

12.6 Box Counting Dimension 

The notion of dimension of a sets arises in many aspects of mathematics, and it is particularly relevant to 
the study of fractals (that besides some important applications make really cool t-shirts). The dimension 
somehow indicates how we should measure the contents of a set (length, area, volume, etc.). The box- 
counting dimension is a simple definition of the dimension of a set. The main idea is to cover the set 
with boxes with sidelength r. Let N (r) denote the smallest number of such boxes, then the box counting 
dimension is defined as 

logN (r) 
Urn , . (12.35) 

r-^o —logr 

Although the boxes considered above do not need to be aligned on a rectangular grid (and can in fact 
overlap) we can usually consider them over a grid and obtain an upper bound on the box-counting dimension. 
To illustrate the main ideas let's consider a simple example, and connect it to the classification scenario 
considered before. 

Let / : [0,1] — > [0,1] be a Lipschitz function, with Lipschitz constant L (i.e., \f (a) — f(b)\ < L\a — 
b\, Vo, be [0,1]). Define the set 

A = {x = (x u x 2 ):x 2 = f(x 1 )}, (12.36) 

that is, the set A is the graphic of function /. 

Consider a partition with k 2 squared boxes (just like the ones we used in the histograms), the points in 
set A intersect at most Ck boxes, with C' = (1 + \L\) (and also the number of intersected boxes is greater 
than k). The sidelength of the boxes is 1/fc therefore the box-counting dimension of A satisfies 

dim B (A) < Um^ Xfwk) 

u logC'+log{k) (12.37) 

= 1. 

The result above will hold for any "normal" set A C [0, 1] that does not occupy any area. For most sets the 
box-counting dimension is always going to be an integer, but for some "weird" sets (called fractal sets) it is 
not an integer. For example, the Koch curve has box-counting dimension log (4) /log (3) = 1.26186.... This 
means that it is not quite as small as a 1-dimensional curve, but not as big as a 2-dimensional set (hence 
occupies no area). 

To connect these concepts to our classification scenario consider a simple example. Let r\ (x) = 
P (Y = 1\X = x) and assume r\ (x) has the form 

n (x) = - + xi - f (xi) , Vx=(x u X2)€X, (12.38) 

where / : [0, 1] — * [0, 1] is Lipschitz with Lipschitz constant L. The Bayes classifier is then given by 

/* ( x ) = l{r,(a;)>l/2} = 1 {x 2 >f(x 1 )}- (12.39) 



83 

This is depicted in Figure 12.7. Note that this is a special, restricted class of problems. That is, 
we are considering the subset of all classification problems such that the joint distribution Pxy satisfies 
P (Y = 1\X = x) = 1/2 + X2 — f (xi) for some function / that is Lipschitz. The Bayes decision boundary is 
therefore given by 



A = {x = (x 1 ,x 2 ) ■ x 2 = /(xi)}. 
Has we observed before this set has box-counting dimension 1. 



(12.40) 




Figure 12.7: Bayes decision boundary for the setup described in Appendix 
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Chapter 13 

Complexity Regularization for Squared 
Error Loss 1 

13.1 Complexity Regularization in Regression 

Recall the classification problem. In Lecture 6 (Chapter 7), where we assumed that minf£^R(f) = 0, we 
obtained the PAC bound V/€J 



V{R\f n \ >e}<\T\e-^ 
From Corrolary 1 in Lecture 6 (Corollary 7.1, p. 47), 

1 + log\T\ 



E 



R\f, 



< 



(13.1) 



(13.2) 



In Lectures 7 (Chapter 8) and 8 (Chapter 9), we dropped the assumption that minf^^R(f) = and 
obtained, V/ € T 



v{r /„>£}< in 



-2ne z 



This led to 



E 



Rlf n \-minR(f) 



< 



log\T\ + logn + 2 



(13.3) 



(13.4) 



Hoeffding's inequality was central to our analysis of learning under bounded loss functions. In many 
regression and signal estimation problems it is natural to consider squared error loss functions (rather than 
0/1 or absolute error). In such cases, we will need to derive bounds using different techniques. 

Example 13.1 

To illustrate the distinction between classification and regression, consider a simple, scalar signal 
plus noise problem. Consider Yi = 8 + Wi, i = 1, • • • , n, where 9 is a fixed unknown scalar parameter 
and the Wi are independent, zero-mean, unit variance random variables. Let Y = l/w^ILi^i- 



1 This content is available online at <http://cnx.Org/content/ml6267/l.2/>. 
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Then, according to the Central Limit Theorem, Y is distributed approximately N (9, 1/n). A simple 
tail-bound on the Gaussian distribution gives us 



P(Y -6>e) =P{W > e) < -e 



-ne 2 /2 



(13.5) 



which implies that 



P(\Y 



> s) < e 



-ns 2 /2 



(13.6) 



This is a bound on the deviations of the squared error err 2 = \Y — 9\ . Notice that the exponential 
decay rate is a function of e rather than e 2 , as in Hoeffding's inequality. The squared error con- 
centration inequality implies that E [\Y - 9\ 2 ] = O (^) (just write E [err 2 ] = / °° P (err 2 > t) dt). 
Therefore, in regression with a squared error loss, we can hope to get a rate of convergence as fast 
as n _1 instead of n -1 / 2 . The reason is simply because we are using an squared error loss instead 
of the 0/1 or absolute error loss. 

To begin our investigation into regression and function estimation, let us consider the following. 
Let X = R d and y = R. Take .Fsuch that / e T is a map / : R d \— > R. We have training data 
{Xi, Yi}f =1 ~ ' Pxy- As our loss function, we take the squared error, i.e., 



The risk is then the MSE: 



l(f(X l ),Y l ) = (f(X l )-Y i y 



R(f) = E[(f(X)-Yf] 



(13.7) 



(13i 



We know that the function /* that minimizes the MSE is just the conditional expectation of Y 
given X: 



f*=E [Y\X = x] . 



(13.9) 



Now let R* = R (/*) . We would like to select an f n s T using the training data {Xi, Yi}f =1 such 
that the excess risk 



E 



R\fr 



R* >0 



(13.10) 



is small. Let's consider the difference between the empirical risks: 
~ ~ i n i n 

R(f)-R cn = - E (/ (^) - ^) 2 - - E (/* (^) - Y *> 



(13.11) 



i=\ 



2=1 



Note that E 



R(f)-R(n 



(SLLN), we know that 



R(f) — R(f*) ■ Hence, by the Strong Law of Large Numbers 



R(f)-R(n^R(f)-R(n 

as n — » oo. But how fast is this convergence? 



(13.12) 
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We will derive a PAC style bound for the difference R (/) - R (/*) - (R(f) - R{f*)) . The 
following derivation is from Barron 1991. The excess risk and it empirical counterpart will be 
denoted by 

r(f,n = R(i)-R(n 

(13.13) 

r(f,n=R(f)-R(n 

Note that r (/, /*) is the sum of independent random variables: 

1 ™ 
r (/,/*) = — y> i} (13.14) 

where E/i = -(Y,-/(X 4 )) 2 + (F; - /* (JQ)) 2 . Therefore, r (/,/*)- r (/,/*) = 
We are looking for a PAC bound of the form 

v(r(f,n-r(fJ*)>s]<S. (13.15) 



If the variables Ui are bounded, then we can apply Hoeffding's inequality. However, a more useful 
bound for our regression problem can be derived if the the variables Ui satisfy the following moment 
condition: 

E [\Ui - E [Ui] \ k ] < ^^ k\ h k ' 2 (13.16) 

for some h > 0. 

The moment condition can be difficult to verify in general, but it does hold, for example, for 
bounded random variables. If (13.16) holds, then the Craig-Bernstein (CB) inequality states: 

v( l -±m-E m > A + neV ^ Ui) ) * ^ (13-17) 

V n ~ — ' ne 2(1 — c) I 

for < eh < c < 1 and t > 0. This shows that the tail decays exponentially in t, rather than 
exponentially in t 2 . Recall Hoeffding's inequality: 

(l " t\ 

V\-Y j {Z i -E[Z i \)>-\<e-^. (13.18) 

If - -C 1, then — -c t, which implies e^^ 3> e~*. This indicates that the CB inequality may 
be much tighter than Hoeffding's. To use the CB inequality, we need to bound the variance of 

var {Ui) = var (-(Y- - / (X,)) 2 + (Y t - f* (X)) 2 ) . (13.19) 

Assumption 1 

The support of Y and the range f(X) is in a known interval of length b. 
Proposition 1 

With the above assumption, (13.16) holds with h = 2 |-. 
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Proposition 2 

Again, with the above assumption, it may be shown that 

var(U,)<5b 2 r(fJ*). (13.20) 

You can write Ui as 

Ui = 2Y i f(X i ) - 2Y i f*(X i ) + f*(X t ) 2 - /(X t ) 2 = 2Y i f(X i ) - 2YJ* (X t ) + (13.21) 
2f*(X t ) 2 - ;*(X t ) 2 - fiX,) 2 + 2f(X i )f*(X i ) - 2f(X i )f*(X i ) = 
2 {Yt - f* (X,)) (/ (X) - /• (Xi)) - (/ (Xi) - P TO) • 

Note that the variance of Ui is upper-bounded by its second moment. Also note that the covariance of 
the two terms above is zero: 

i?[2(y J -r(x 4 ))(/(^)-r(x l ))(/(x J )-r(x 4 )) 2 ] = e\t x t 2 ] 

= E x [E Y \x PiT 2 ]] 
= E X [T 2 E Y]X [2\]] (13.22) 

Ex [T 2 * 0] 


This is evident when you recall that /* (Xi) = E [Y\X = Xj\ . Now we can bound the second moments of 
Ti and T 2 : 



bit,} = 4i?[((y,-r(^))(/(x,)-r(x,))) 2 ] 

= AE [(Yi - f* (Xi)) 2 (f (X^ - f* (X t )f] 

< AE[b 2 (f(Xi)-f*(Xi)) 2 ] 

E[T 2 ] = E[(f(Xi)-f*(Xi)) 4 ] 



(13.23) 



< 



E[(f(Xi)-f*(Xi)) 2 (f(Xi)-f*(Xi)) 2 ] 

E^uix^-nx,)) 2 ] 

So var (Ui) < 5b 2 E \(f {X t ) - f* {X t j) 2 ] . The final step is to see that 

r (/, /*) = E [Ui] = E x [E Ylx [U t ]\ = E [(/ (X t ) - f* (X l )f] . (13.24) 

Thus, n var (- X^T=i ^*) — §b 2 r (f, /*) . And therefore, we can say that, with probability at least 1 — e - *, 

t 5eb 2 r (/,/*) 



r (/,/*) -r (/,/*)<— + "" • ( 13 - 25 ) 

n £ 2(1 — c) 



In other words, with probability at least 1 — <5 (where <5 = e 



-f» 



r(/,n-;(/,n<^+ 5g f fi r(/ f ) - (13-26) 

n £ 2 (1 — c) 

Now, suppose we have assigned positive numbers c(f) to each / € T satisfying the Kraft inequality: 

Y, 2~ c(/) < 1. (13.27) 
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Note that (13.26) holds \/6 > 0. In particular, we let 5 be a function of f: 

5{f) = 2- c( -^S. (13.28) 

So we can use this 5 along with the procedure introduced in Lecture 9 (Chapter 10) (i.e., Union of events 
bound followed by the Kraft inequality) to obtain the following. For all / € !F,\/5 > 0, 

r(/.r)-M/,r)< C(/) "' g2 + '^ + fe f f ; (/ f ) (.3.29) 

n e 2 (1 — c) 

with probability at least 1 — S. Now set c = e h = 2b 3 £ and assume e < j^p ■ Then define 

Kp- h 2 

< 1. (13.30) 



2(l-c) 

Now, after using a and rearranging terms, we have: 



c(f)log2 + log\ 
l-o r (/,/*)<r (/,/*) + -^-^ ^. 13.31 



We want to choose f to minmize this upper bound. So take 



/„ = argmin{R n (/) + c ^ og2 }. (13.32) 



So, with probability at least 1 — 5, 

/- \ *. /- \ c\ f n \log2+log\ 

(l-a)r /„,/* < r /„,/* + 



(13.33) 



< r (/*,/*) + £i£»^i 



where /* = ar grain fe r{R(f) + c(/ ^'° g2 } 



Now we use the Craig-Bernstein inequality to bound the difference between r (/*,/*) and r (/*,/*) : 
With probability at least 1 — <5, 

r(/;,r)<r(/*,r) + or(/;,r) + ^M. (13.34) 

ne 

Now we can again use the union bound to combine (13.33) and (13.34): With probability at least 1 — 25,\/S > 
0, 

r f;., r )<^±«,oi, f ) + ^5)J3g±jw» , (I3 .35) 

\ / 1 — a ne 

Now set (5 = e^~, then we have 

*> fr (;„,/•) - \^r(f:, n + ^^ > t) < 2e-. (13.36) 

V \ ] I — a ne I 



Integrating, we get 






< / °°7>(">*) ett 

< /o°°2, 






(13.37) 



_4_ 
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To sum up, we have shown that for e < j§^, 



E 



r\! n ,r 



<i^W,n + c(/;W+4 . 



1 — a 



II z 



or. 



E 



r[f n ,r 



-- I I min{r (/, j ) H } H 



1 — a J /e-7 7 



ne ne 



since a < 1. Or, in expanded form: 



E 



R\fr, 



R(n < (£g) ™{«(/)-fl(f) + ^} + 






(13.38) 



(13.39) 



(13.40) 



Notice that if /* ef and if c (/*) is not too large (e.g., c (/*) w logn), then we have i? 
O (n~ 1 logn), within a logarithmic factor of the parametric rate of convergence! 



A /, 



-R(f* 



Chapter 14 

Maximum Likelihood Estimation 1 



In the last lecture (Chapter 13) we derived a risk (MSE) bound for regression problems; i.e., select an 

/ e T so that E \{f {X) - Y) 2 ] - E \{f* {X) - Y) 2 ] is small, where /* (x) = E [Y\X = x\. The result is 

summarized below. 

Theorem 14.1: Complexity Regularization with Squared Error Loss 

Let X = R d , y = [—6/2,6/2], {Xi,Yi}f =1 iid, P X y unknown, T = {collection of candidate 
functions}, 

f:R d ^y, R(f)=E[(f(X)-Y) 2 ]. (14.1) 

Let c(f), f € T, be positive numbers satisfying ^2f e jr2~ c ^ < 1, and select a function from T 
according to 

/„ = argmin{R n (f) + - C(/) ^ g2 }, (14.2) 

e n 

with £ < £ and R n (f) = I Er=i (/ (*i) ~ Y if ■ Then > 



E 



R\fn 



R(fl < (r^-) m t n{R(f)-R(n+ 1 - C -Mll^ } + 0(n-i) (14.3) 

\1 — a J fer en v 



where a = x _^ s/3 . 

14.1 Maximum Likelihood Estimation 

The focus of this lecture is to consider another approach to learning based on maximum likelihood estimation. 
Consider the classical signal plus noise model: 

Yi = f(^\+Wi,i=l,--- ,n (14.4) 

where W\ are iid zero-mean noises. Furthermore, assume that Wi ~ P (w) for some known density P(w). 
Then 

Yi~p(y-f(±)}=P L iu) (14-5) 



lr This content is available online at <http://cnx.Org/content/ml6276/l.2/>. 
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since y i -/(i) = W i . 

A very common and useful loss function to consider is 



R, 



1 ™ 

(/) = ;LH^Aft))- (14-6) 



Minimizing i? n with respect to / is equivalent to maximizing 

1 " 

-J2logP ft (Yj (14.7) 



n 



or 



t[Pf t m- (i4.s 



Thus, using the negative log-likelihood as a loss function leads to maximum likelihood estimation. If the 
Wi are iid zero-mean Gaussian r.v.s then this is just the squared error loss we considered last time. If the 
Wi are Laplacian distributed e.g. P (w) oc e~\ w \, then we obtain the absolute error, or Li, loss function. We 
can also handle non-additive models such as the Poisson model 

Y ~P(y\f{i/n)) = e -/«/n) [/(»/")]* , (14 . 9) 

In this case 

-logP(Y\f(i/n)) = f (%/n) - Ylog (/ (t/n)) + constant (14.10) 

which is a very different loss function, but quite appropriate for many imaging problems. 

Before we investigate maximum likelihood estimation for model selection, let's review some of the basic 
concepts. Let denote a parameter space (e.g., = R), and assume we have observations 

r i ~Pfl.(l/) ) i=l,---,n (14.11) 

where 0* G is a parameter determining the density of the {Y{\. The ML estimator of 9* is 

On = argmax\Xl_ 1 P0(Y i ) 

= argmaxJ2^ =1 logPe{Y) (14.12) 

= ar grain Yh=\ -logPe (Yj) . 

9eQ 

9 maximizes the expected log-likelihood. To see this, let's compare the expected log-likelihood of 6* with 
any other 6 G 0. 

. Pe*(Y) 
Pe(Y) 



E[logPg,(Y)-logP e (Y)} = E^og 1 - 

- Pa, (n) dm 

(14.13) 



Jlog%$Pe.(y)dy 



= K(Pg,Pg*) the KL divergence 
> with equality iffPg* = Pg. 



Why? 



hW 



E 



[°9TV~ 



Pq(v) 



< logE 



Peiv) 



[Pe*(y)_ 
= log J Pg (y) dy = 

=> K{Pg,Pg,)>0 

On the other hand, since Q n maximizes the likelihood over 9 € 0, we have 

(Yi) 



£ log i^m = E l °9 Ps * W ~ l °3 p ~ ^ ^ °- 






Therefore, 



1 " 



jv (y«) 



K P~ 



K\P~ , P e . < 



or re-arrangmg 



A ' V P "' " 



^X>#- 



w 



Notice that the quantity 



n ti P ~ W 



2—1 



K P~ , Pg, 



is an empirical average whose mean is K (Pg, Pg-). By the law of large numbers, for each 9 G O, 

z— 1 v ' 

If this also holds for the sequence {#«}, then we have 



in p. , p«. < 



Tl *■ J 



9- (Yi) 



K P~ 



as n — > oo 



which implies that 



which often implies that 



P. ^Pg, 
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(14.14) 



(14.15) 



(14.16) 



(14.17) 



(14.18) 



(14.19) 



(14.20) 



(14.21) 



in some appropriate sense (e.g., point- wise or in norm) 



(14.22) 
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Example 14.1: Gaussian Distributions 

Pg, (y) = -^e-^- 9 '^ 2 (14.23) 

V 71 " 

6 = R, {yJLi ~ Pe> (V) (14-24) 

K(P e ,Pg.) = Jl 0g ^lp g , {y)d y 

= j[(y-o) 2 -(y-o*) 2 ]Pe*(y)dy 

= E g , [(y-ef] -Eg, [(y-9*f] 
Eg, [Y 2 -2Y6+6 2 } - 1/2 
{9*f + 1/2 -29*9 + 9 2 -1/2 

in* /i\2 



\2, 



(14.25) 



9* maximizes E [logPg (Y)] wrt 9 e 6 (14.26) 



argmax{— ^ (^ — 0) } 
argmin{Y, {Yi ~ df} (14.27) 

8 
1 y^« v. 



14.1.1 Hellinger Distance 

The KL divergence is not a distance function. 

K(P 01 ,Pg 2 )^K(P 021 P 01 ) (14.28) 

Therefore, it is often more convenient to work with the Hellinger metric, 

H (P dl ,Pg 2 )= (/ (Pi - Pi) 'rfy) ' • (14.29) 

The Hellinger metric is symmetric, non-negative and 

H(P ei ,Pg 2 ) = H(Pg 2 ,P ei ) (14.30) 

and therefore it is a distance measure. Furthermore, the squared Hellinger distance lower bounds the KL 
divergence, so convergence in KL divergence implies convergence of the Hellinger distance. 
Proposition 1 

H 2 (P gi ,Pg 2 )<K(P ei ,Pg 2 ) (14.31) 
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Proof: 



H(P ei ,Pe 2 ) = j '(sfPeJv) ~ \fPeJv))' 'dy 

= JP 9 Ay)dy + JP 02 (y)dy-2jJP 9 ~^)^P ~ 2 ~W)dy 
= 2-2jy/P ei {yWPeAy)dy, since J Pg (y) dy = l\/9 

2 (l - Eg 2 [y/P 6l (Y) /Pg 2 (Y)] ) (14.32) 

< 2log (e 02 yP 02 (F) /P 01 (F)] ) , since 1 - x < -logx 

< 2Eg 2 \log v / P 02 (Y) /P 01 (Y)] , by Jensen's inequality 

E 02 [log(P 62 (Y)/PeAY))} = K(P 011 P 02 ) 



Note that in the proof we also showed that 

H (P $1 , Pe 2 ) = 2 (l - J y/PeAvWPeAy)dy) 
and using the fact logx < x — 1 again, we have 

H(P 01 ,P 02 ) < -2log (J ^ PeAvW PeAv)dy\ ■ 
The quantity inside the log is called the affinity between P 01 and P 02 : 

A (P 01 ,P 02 )= I y/PeAyWPeAy)dy. (14-35) 



This is another measure of closeness between P 01 and P 02 . 
Example 14.2: Gaussian Distributions 



-2io 9 J ^P Ay)V p eAy)dy 

(v-» 2 ) 2 

J dy 
-2log 



2log 


(' 


■7- e L 


-&i) 2 | (y 

2 ' 


.(/ 


j* 


r [(v-( S 


^r^ 






—2logeT 


( ^ )2 






Wi- 


-^ 2 ) 2 



-2logA(P 01 ,P 02 ) = \{6\ — 62) for Gaussian distributions 
H{P 01 ,P 02 ) < |(6>i - # 2 ) 2 for Gaussian. 



Example 14.3: Poisson Distributions 

If P (y) = e- ee ^,0>O, then 



IV ! 



2 



(14.33) 



(14.34) 



Pe (y) = -e-^ 2 (14.36) 



(14.37) 



(14.38) 



2logA (P 01 ,P 02 )=[ y/0! -VO2) ■ (14.39) 
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Summary 



i ill 



Y t ~ Pg, 



(14.40) 



1. Maximum likelihood estimator maximizes the empirical average 

1 " 
-^TlogPeiYi) 

n ^ — ^ 



i=l 



(our empirical risk is negative log-likelihood) 
2. 6* maximizes the expectation 



E 



1 

- ]T logPg (Yi) 



(the risk is the expected negative log-likelihood) 



1 n 1 n 

-Y^logPeiYi)™- E -J^logPoW 

m z — J r> z — J 



so we expect some sort of concentration of measure. 
4. In particular, since 

1 V^ , P«' ( Y i) a... 



1=1 



Pe (Yi) 



K(P e ,P e *) 



(14.41) 



(14.42) 



(14.43) 



(14.44) 



we might expect that K P~ ,Pg* — » for the sequence of estimates {P~ }J£Li- 

V Br, J 9n 

So, the point is that maximum likelihood estimator is just a special case of a loss function in 
learning. Due to its special structure, we are naturally led to consider KL divergences, Hellinger 
distances, and Affinities. 



Chapter 15 

Maximum Likelihood and Complexity 
Regularization 1 

15.1 Review : Maximum Likelihood Estimation 

In the last lecture (Chapter 14), we have n i.i.d observations drawn from an unknown distribution 

Yi'&pe. , i = {l,...,n} (15.1) 

where 9* <= 9. (15.2) 

With loss function defined as I (9,Yj) = —logpg (Y,), the empirical risk is 

1 " 
R n = Y^logpeiYi). (15.3) 

Essentially, we want to choose a distribution from the collection of distributions within the parameter space 
that minimizes the empirical risk, i.e., we would like to select 

p, eV = { Pe } gee (15.4) 

where 

9 n = arg min - ^ logpg (Yj) . (15.5) 

i— 1 

The risk is defined as 

R(9) = E [I (9, Y)] = -E [logpg (Y)] . (15.6) 

Note that 9* minimizes R (9) over 0. 



arg min — E [logpg (Y)] 
argmin - J logpg (y) ■ p e , (y) dy. 



(15.7) 



1 This content is available online at <http://cnx.Org/content/ml6275/l.2/>. 
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Finally, the excess risk of 9 is defined as 

R(9)-R(9*) = flog 1 ^lp e ,{y)dy = K{pe,P6'). (15.8) 

J Pe (y) 

We recognized that the excess risk corresponding to this loss function is simply the Kullback-Leibler (KL) 
Divergence or Relative Entropy, denoted by K (pe 11 pe 2 )- It is easy to see that K (pe 1 ,pe 2 ) is always 
non-negative and is zero if and only if po 1 = pg 2 . KL divergence measures how different two probability 
distributions are and therefore is natural to measure convergence of the maximum likelihood procedures. 
However, K(pg 1 ,pg 2 ) is not a distance metric because it is not symmetric and does not satisfy the triangle 
inequality. For this reason, two other quantities play a key role in maximum likelihood estimation, namely 
Hellinger Distance and Affinity. 
The Hellinger distance is defined as 

H{po 1 ,pe 2 ) = ( / [\/pe 1 (y) - Vp02 (y)) d y) ■ ( 15 - 9 ) 

We proved that the squared Hellinger distance lower bounds the KL divergence: 



H 2 {PQ^Pe 2 ) < K(p 9l ,pe 2 ) 
H 2 {pe 1 ,pe 2 ) < K(pe 2 ,pe 1 ) . 



The affinity is defined as 



we also proved that 



Example 15.1: Gaussian Distribution 

Y is Gaussian with mean 9 and variance a 2 . 



V27TCT 2 

First, look at 



(15.10) 



A{p$ 1 ,P0 2 )= / \/P9i ■ P0 2 (y) d y ■ (15.11) 



H 2 (p ei ,pe 2 ) < -2log(A(p ei ,pe 2 )) . (15.12) 



1 (M-e) 2 

Pe (y) = -^==e - 2 . (15.13) 



p Sl 2a A 
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Then, 



K(pe 1 ,Pe 2 ) = E 02 [^g^-\ 



9j-el_2_(e 1 - M i y . pf)Ay)dy 



^W + q-vko*) = { ^ 1 



■■: " - \ 1/2 / , <v-<>2) 2 ^ 1/2 



-2logA( Pei ,p 02 ) = -2lo 9 yS{y^e- ^ j ■ [y^e~ ^ J dyj (1{U5) 

= —2loge 2» 2 



15.2 Maximum likelihood estimation and Complexity regularization 

Suppose that we have n i.i.d training samples, {J*Q, 5^}™ =1 ~ ' Pxy • 
Using conditional probability, p^y can be written as 

Pxy (x, y) = px (x) ■ Py\x=x (y) ■ (15.16) 

Let's assume for the moment that px is completely unknown, but py\x=x (y) nas a special form: 

PY\x=x{y) =Pf(x){y) (15.17) 

where py\x=x (y) ls a known parametric density function with parameter /* (x). 
Example 15.2: Signal-plus-noise observation model 

Yi = f*(Xi) + Wi ,i = l,...,n (15.18) 

where W % U ~ ' M (0, a 2 ) and X, 4 ~ ' Px . 

1 (y-/-(x)) 2 

P/*(-)(y) = ^7^^ e "' ( 15 - 19 ) 



y|X = x ~ Poisson(/* (x)) 
The likelihood loss function is 



y! 



(15.20) 



l(f(x),y) = -logp XY (X,Y) 

= -logp x (X)-logpYix(Y\X) (15.21) 

= -logpx (X) - logp f (x) (Y) . 
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The expected loss is 

E[l(f(X),Y)] 



E x [E Ylx [l(f(X),Y)\X = x}] 

E x [E Y \x [-logpx (x) - logvj(x) (Y) \X = x] ] 

-E x [logpx (X) } - E x [E Y{X [logp m (Y)\X = x}} 

-E x [logpx (X)]-E [logp fix) (Y) ] . 



(15.22) 



Notice that the first term is a constant with respect to /. 
Hence, we define our risk to be 

R(f) = -E[logp fm (Y)} 

-Ex[E Y] x[logp f{x) {Y)\X = x\} 

= - 1 (I l °9Pf(x) (y) ■ Pf(x) (y) dy) p x 0) dx . 

The function /* minimizes this risk since / (x) = f* (x) minimizes the integrand. 
Our empirical risk is the negative log- likelihood of the training samples: 



(15.23) 



1 " 
Rn(f) = -J2 -l°9Pf(X t )(Y t ). 



(15.24) 



The value — is the empirical probability of observing X = Xi. 

Often in function estimation, we have control over where we sample X. Let's assume that 
X = [0, 1] and y = R. Suppose we sample X uniformly with n = m d samples for some positive 
integer m (i.e., ,take m evenly spaced samples in each coordinate). 

Let Xi ,i = l,...,n denote these sample points, and assume that Yi ~ Pf*(xA iv)- Then, our 
empirical risk is 



Rn(f) 



1 n 



,y: 



i ™ 



logp f ( Xi ) {Yi) 



(15.25) 



Note that Xi is now a deterministic quantity. 
Our risk is 



R(f) = -mtiE[log Ps{Xi) {Y t )} 



(15.26) 



The risk is minimized by /*. However, /* is not a unique minimizer. Any / that agrees with /* 
at the point {xi,Yi} also minimizes this risk. 

Now, we will make use of the following vector and shorthand notation. The uppercase Y denotes 
a random variable, while the lowercase y and x denote deterministic quantities. 



Y 



Then, 

Pf (Y) = Il^iPWf (Xi)) (random) 
Pf(y) = ]Ti=iP(yi\f(xi)) (deterministic) 



" Yi ' 




y\ 




X\ 


Y 2 


y = 


yi 


X = 


X2 


. Yn . 




. Vn . 




$n 



(15.27) 
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With this notation, the empirical risk and the true risk can be written as 

Rn(f) = —logPfOO- 

R(f) = -l E [log Pf (Y)} ( 15 - 28 ) 

= - 1 1 logpf (y) ■ Pf» (y) dy . 
15.3 Error Bound 

Suppose that we have a pool of candidate functions T ', and we want to select a function / from T using the 

training data. Our usual approach is to show that the distribution of R n (/) concentrates about its mean 
as n grows. First, we assign a complexity c(/) > to each / s T so that ^ 2~ C W> < 1. Then, apply the 
union bound to get a uniform concentration inequality holding for all models in T. Finally, we use this 
concentration inequality to bound the expected risk of our selected model. 

We will essentially accomplish the same result here, but avoid the need for explicit concentration inequal- 
ities and instead make use of the information-theoretic bounds. 

We would like to select an / e T so that the excess risk is small. 



< R(f)-R(f*) 

= \E [logpf. (Y) - logpf {¥)] 

n I » Pf(Y) 
= Tl K (Pf:Pf) 



(15.29) 



where 



K(Pf,Pf.) = J2( f log Pf ' iXi) ^ yi) -Pf-^ivi) dy t ) (15.30) 

v ' 

is again the KL divergence. 

Unfortunately, as mentioned before, K (pf,Pf») is not a true distance. So instead we will focus on the 
expected squared Hellinger distance as our measure of performance. We will get a bound on 



1 1 n / P 2 

-E [H*( Pf (Y), Pr (Y))] = "E ( / (yJPf^M)- vW)H> d Vi 



(15.31) 



15.4 Maximum Complexity- Regularized Likelihood Estimation 

Theorem 15.1: Li-Barron 2000, Kolaczyk-Nowak 2002 

Let {xi,Yi}2 = i be a random sample of training data with {1^} independent, 

Y i~Pf*(xi){Vi) ,i = l,-,n (15.32) 

for some unknown function /*. 
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Suppose we have a collection of candidate functions T, and complexities c(/) > 0,/ € T, 
satisfying 



J2 2~<f ] < 1. 
Define the complexity-regularized estimator 



/„= argmin{--^logp f (Y l ) 



2c(f)log2 



feF n 



} 



Then, 



(15.33) 



(15.34) 



±E[H 2 (p f (Y),p f ,(Y))] < -$E[log(A(pf(Y),p f .p)))] 

< min{\K (pf,Pf) + 2c(J l log2 } . 



(15.35) 



Before proving the theorem, let's look at a special case. 
Example 15.3: Gaussian noise 

Suppose Yi = f (Xi) + Wi ,Wi i- ~ ' M (0, a 2 ). 



Pf(xi) (Vi 



V2 



i _(vi-f(*i)y 

e 2a 2 



Using results from example 1 (Example 15.1: Gaussian Distribution), we have 



(15.36) 



■2MP. (Y),p f .(Y) 



V fnjXi) / 

E"=i - 2l °gf fp~~ (Vi) -Pfixi) {Vi)dyi 
Mxt) 






ELi fn Oi) - /* ( a 



(15.37) 



Then, 



-E 



log A I p~ ,p f , 



i ™ 



4cr 2 n 



/n (Xi) ~ f* (Xi) 



We also have, 



(15.38) 



\K(p h p r .) 
-logpf{Y) 
Combine everything together to get 






2<r 2 



En 
i=l 



/„= argmin{- V" 

i— 1 



lv-(*i -/(*<)) , 2c(/)^<?2 



2a 2 



} 



(15.39) 



(15.40) 



The theorem tells us that 



1 ™ 
An ^ 



In (Xi) - f* (xi) 



. 1 " (f (Xi) - f* (Xi)) 2 , 2c(/)/o 5 2 
< mim — > — 



/e^ n 



2a 2 



or 
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} (15-41) 



1 " 



fn (Xi) ~ f* (Xi) 



<#-E(/w-/'W)' ^ c(/ "^ 

2—1 



}. (15.42) 



Now let's come back to the proof. 
Proof 15.1: 



H 2 [p~,p f . 



( . 


\ 


2 


J [fab)- 

( r i— 


- \Jvf (y) 


dy 


- 21o9 \I fi 


(y) ■ Pf* (y) dy \ 



affinity 



E 



H 2 \p^ Pr 



< 2E 



( 



log 



. I , p- (y) ■ Pf (y) d v 

\ V /n 



(15.43) 

(15.44) 
(15.45) 



Now, define the theoretical analog of /„: 



Since 



fn = argmin{-K (pf,Pf) H }. 

fef n n 



f n = argmin{-Hogp f (Y) + 2c (/)'°9 2 } 

= argmax{^(logpf(Y) -2c(f)log2)} 

= argmax{\(logpf(Y)-2c{f)log2)} 

= argmax{log (y/p f (Y) ■ e -<f)l°gA } 
= argmax{s/p f (Y) ■ e - c V) io s 2 } 



(15.46) 



(15.47) 



we can see that 



p, (Y)e 



-c /„ io S 2 



v ^"(y) e -c(/„)ios2 



> 1 



(15.48) 
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Then can write 



E 



H 2 \p~ ,Pf> 



< 2E 



log 



2E 
( 

V 



log 



I Ip - (y)-Pf(y)dy 



<p I (Y)e 

fn 



fn 



y / Pf n (Y)e-c(fn)io S 2 f Ip 2 -Pf'dy 

V /.. 



(15.49) 



Now, simply multiply the argument inside the log by </ P/ *LJ to get 



E 



H 2 \p~ ,Pf* 



< 2E 



( 



log 



VpT^jv /„ 



Pf{Y) 



l>.~ <> i -cf/„jlo 9 2 



\ 



V 



VpJJX) VpJ^X) e-^f^">92 J lp^(y).p f ,(y)dy 



E 



2E 



log(^P J )]+2c{f n )log2 



I 



( 



log 



fn 



-el /„ I lo 9 2 



\ 



V 



^/p f *(Y) I lp~ (y)-Pf(y)dy 



2E 



K(p fn ,p f ,) + 2c(f n )log2 

I Ip^W 
log 



I fn I log2 



J 
\ 



(15.50) 



V 



^/Pf'iY) J lp~ (y)-Pf'(y)dy 



J 



The terms K (pf n ,Pf*) + 2c(f n )log2 are precisely what we wanted for the upper bound of the 
theorem. So, to finish the proof we only need to show that the last term is non-positive. Applying 
Jensen's inequality, we get 



2E 



log 



p : (Y) 



c /„ log2 



VpFW) J Jp~ (y) • Pp (y) dy 



\ 




( 




< 2log 


E 


J 




\ 



c /„ log2 



Ip~(y) 
pFW) 



I Jp- (v) ■ pj> (v) d y 



\ 



(15.51) 

Both Y and f n are random, which makes the expectation difficult to compute. However, we can 
simplify the problem using the union bound, which eliminates the dependence on /„: 



2E 



( 



log 



p^W) 

fn 



= 1 fn I log2 



y/p f *(Y) J t jpZ (y)-Pf(y) dy 

fn 



< 2log E 



V e -c(f)log2 






J y/pf(y)-Pf*(y)dy 
M y > 



% E fff r c(/) 



(15.52) 



< 



J \fpf(y)-Pf(y)dy 

2log(j:fer 2 ~ c{f) ] 
0. 
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where the last two lines come from 



E 



S-Z^-^Wv^^* 



(15.53) 



and 



J2 2- c <« < 1. 



(15.54) 
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Chapter 16 

Denoising II: Adapting to Unknown 
Smoothness 1 



16.1 Review: Denoising in Smooth Function Spaces I - Method of 
Sieves 

Suppose we make noisy measurements of a smooth function: 

Yi = f(xi) + Wi, i = {!,..., n}, (16.1) 



where 



and 



The unknown function /* is a map 



Wi *-~ ■ N (0, a 2 ) (16.2) 



>,= [ % -)- (16.3) 



/*:[0,1]-R. (16.4) 



In Lecture 4 (Chapter 5), we consider this problem in the case where /* was Lipschitz on [0, 1] . That is, /* 
satisfied 

\f*(t)-f(s)\<L\t-s\, Vt,«e[0,l] (16.5) 

where L > is a constant. In that case, we showed that by using a piecewise constant function on a partition 
of ri3 equal-size bins Figure 16.1 we were able to obtain an estimator f n whose mean square error was 



E 



r-frf 



0[n-*\. (16.6) 



lr This content is available online at <http://cnx.Org/content/ml6268/l.2/>. 
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Figure 16.1: Example of the piecewise constant approximation of /* 



In this lecture we will use the Maximum Complexity-Regularized Likelihood Estimation result we derived 
in Lecture 14 (Chapter 15) to extend our denoising scheme in several important ways. 
To begin with let's consider a broader class of functions. 



16.2 Holder Spaces 

For < a < 1 , define the space of functions 



H a (C a ) = {\f\<C a : S up ]f{x + ^ a f{x)] < C a } 
x,h \h\ 



(16.7) 



for some constant C a < oo and where / s L^.H 01 above contains functions that are bounded, but less 
smooth than Lipschitz functions. Indeed, the space of Lipschitz functions can be defined as H 1 (a = 1) 



Hi ( Cl ) = { |/| < C X : ^ 1/^ + ^-/^)1 < Cl} 



xji 



\h\ 



(16i 



for Ci < oo. Functions in H 1 are continuous, but those in H a , a < 1, are not in general. 

Let's also consider functions that are smoother than Lipschitz. If a = 1 +/?, where < (3 < 1, then define 



H a (C a ) = {feH 1 (C a ) : ^- G HP (C a )}. 

ox 



(16.9) 



In other words, H a , 1 < a < 2, contains Lipschitz functions that are also differentiable and their derivatives 
are Holder smooth with smoothness j3 = a — 1. 
And finally, let 



H 2 (C 2 ) = {/ : g e H 1 (C 2 )} 



(16.10) 



contain functions that have continuous derivatives, but that are not necessarily twice-differentiable. 

If / € H a (C Q ), < a < 2, then we say that / is Holder— a smooth with Holder constant C a . The notion 
of Holder smoothness can also be extended to a > 2 in a straightforward way. 
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Note: If ot\ < oti then 

/ eH° 2 => / e H ai . (16.11) 

Summarizing, we can describe Holder spaces as follows. If /* e H a (C a ) for some < a < 2 and C a < oo, 
then 



(i):0<o<l \F(t)-f*(s)\<C a \t-s\ 

(ii): Ka<2 *£ (t) - %■ (s) 



^cjt-sF- 1 



Note that in general there is a natural relationship between the Holder space containing the function and 
the approximation class used to estimate the function. Here we will consider functions which are Holder— a 
smooth where < a < 2 and work with piecewise linear approximations. If we were to consider smoother 
functions, a > 2 we would need consider higher order approximation functions, i.e. quadratic, cubic, etc. 

16.3 Denoising Example for Signal-plus-Gaussian Noise Observation 
Model 

Now let's assume /* e H a (C a ) for some unknown a (0 < a < 2); i.e. we don't know how smooth /* is. We 
will use our observations 

Y i = f*(x i ) + W i , * = {l,...,n}, (16.12) 

to construct an estimator f n . Intuitively, the smoother /* is, the better we should be able to estimate it. 
Can we take advantage of extra smoothness in /* if we don't know how smooth it is? The smoother /* is, 
the more averaging we can perform to reduce noise. In other words for smoother /* we should average over 
larger bins. Also, we will need to exploit the extra smoothness in our approximation of /*. To that end, we 
will consider candidate functions that are piecewise linear functions on uniform partitions of [0, 1] . Let 

F k = t\ f \- C: f is P iecewise lm ear on [0, i),[i,|),... [^,1) and the 
coefficients of each line segment are quantized to^logn bits. 
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levels 



C- 

j // 

C - 



(i-1)/k 



Figure 16.2: Example on the quantization of / on interval [^-, £) 



The start and end points of each line segment are each one of yfn discrete values, as indicated in Fig- 
ure 16.2. Since each line may start at any of the yfn levels and terminate at any of the yfn levels, there are 
a total of n possible lines for each segment. 

Given that there are k intervals we have 



|-7"fc| = n ' => log \Tk\ = klogn. 

Therefore we can use klogn bits to describe a function /e Jj. 
Let 

T = |J ?k- 

fe>i 

Construct a prefix code for every / s T by 

(i) Use 000 ■ • • 1 to encode the smallest k such that / G Tk 

k bits 

(ii) Use klogn bits to encode which element of Tk we are considering. 
Thus, if / 6 Tk, then the prefix code associated with / has codeword length 



which satisfies the Kraft Inequality 



c (/) = k + klogn = k (1 + logn) 



£)2- c W<l. 

/6^ 



(16.14) 



(16.15) 



(16.16) 

(16.17) 
(16.18) 



Ill 



Now we will apply our complexity regularization result to select a function f n from T and bound its risk. 
We are assuming Gaussian errors, so 

(Y - f (^)) 2 
-logpf (Yi) = V - \ n " + constant. (16.19) 

We can ignore the constant term and so our empirical selection is 

/„ = argmm{- ^ — 2 + }. (16.20) 

i—l 

We can compute f n according to: 
For k = 1 , . . . , n 

-CO ~ 1 n / r . _W_i\\ 2 

f_ = arqmin R„ ( f) = arqmin— > ^ (16.21) 

2—1 

then select 

\ 2k (I + loan) loq2 ^ 

fn + — } (16.22) 

/ n 

and finally 

. .(. 

fn =f} ' ■ (16-23) 

Because the KL divergence and —2log affinity simply reduce to squared error in the Gaussian case (Lecture 

14) (Chapter 15), we arrive at a relatively simple bound on the mean square error of f n 

1 n 

-J2 E 

n £ — i 



n 

1=1 



'-■;->•; 



^z(/(;D-/-G))+^«>- <«-> 



The first term in the brackets above is related to the error incurred by approximating /* by an element of 
T. The second term is related to the estimation error involved with the model selection process. 

Let's focus on the approximation error. First, suppose /* e H a (C a ) for 1 < a < 2. Let /£ be the 
"best" piecewise linear approximation to /*, with k pieces on intervals [0, jc) > [i> §) i ••• [x 1 , l) • Consider 
the difference between /* and /£ on one such interval, say [tt", i) ■ By applying Taylor's theorem with 
remainder we have 

for t G [tt-, i) and some t' G [t, i] . Define 
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Note that ft (£) is not necessarily the best piecewise linear approximation to /*, just good enough for our 
purposes. Then using the fact that /* G H a (C a ), for t € [i — 1/k, i/k) we have 



(16.27) 



If (t) - 


-/**(*)! = 


df" 
dx 


(*■)(* -*)-&(*)( 


t-i) 




< 




i 

k 


9x I V dx \ k ) 






< 




lr< \+' i I" -1 

fc° Q r fel 






< 




1 r c iN i a_1 - r h~ a 





So, for all t G [0, 1] 

ir(t)-/fc(*)i<c a fc- a . 

Now let /fc be the element of Tk closest to ft (/& is the quantized version of ft) 



(16.28) 



|/*(t)-/ fc (t)| = |/*(t)-/ fc *(t) + / fc *(t)-A(t)| 

< l/*(*)-/ fc *(*)l + l/fc*(*)-/fc(t)l 
— a v™ 

since we used \logn bits to quantize the endpoints of each line segment. Consequently, 

ir (t) - ft (t)\ 2 < \r (t) - r k (t)\ 2 + 2 \r (t) - r k (t)\ \r k (t) - f k (t)\ + \r k (t) - f k (*)i ! 



< 



clk- 



-2C a k - 



Thus it follows that 



(16.29) 



(16.30) 



mi n { 2 -±UWn)-rWn)f^ 2c{f)l092 }<2Cl k ^^ 

f&F k n i —' n \ln n n 

i—l 

The first and last terms dominate the above expression. Therefore, the upper bound is minimized when 
k~ 2a and - are balanced. This is accomplished by choosing k = \n 2a + 1 J . Then it follows that 



mini— > f 
f£F k v n^\ J 



7 = 1 



'- )-r - 

n J \n 



SO -"c (J) log iz , / 2a \ 

+ — — } = O (n-^+ilogn) . 



(16.32) 



If a = 2 then we have 



1 " 



'■<; -'U 



O I n 5 logn j 



(16.33) 



If /* e i7 Q (C Q ) for < a < 1, let /£ be the following piecewise constant approximation to /*. Let 



ft (t) = f* ( — ) on interval 

.n, 



i—l i 



fc ' k 



(16.34) 



Then 



l/*(*)-/ fc *(*)l = !/*(*)-/*(£) I 

< C a \t-i\ a 



(16.35) 



Repeating the same reasoning as in the 1 < a < 2 case, we arrive at 

2" 






'•l; -Ws 



O 



logn 



for < a < 1. In particular, for a = 1 we get 

1 






/, 



In s /ogri 



within a logarithmic factor of the rate we had before (in Lecture 4 (Chapter 5)) for that case! 



16.4 Summary 
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(16.36) 



(16.37) 



1. f n can be computed by finding least-square line fits to the data on partitions of the form 

[0> h) ' [i> I) '••• [nr^'l) f° r ^ = l)--) n i an(: l then selecting the best fit by the fc that gives the 
minimum of the complexity regularization criterion. 

2. If /* e H a (C Q ) for some < a < 2, then 



MSE 



f-)-;±s 



O in 2 "+ 1 logn 



(16.38) 



3. f n automatically picks the optimal number of bins. Essentially f n (indirectly) estimates the smoothness 
of /* and produces a rate which is near minimax optimal ! (n~ 2a + 1 is the best possible). 

4. The larger a is the faster the convergence and the better the denoising ! 



114 CHAPTER 16. DENOISING II: ADAPTING TO UNKNOWN SMOOTHNESS 



Chapter 17 

Nonlinear Approximation and Wavelet 
Analysis 1 

17.1 Review 

In Lecture 4 (Chapter 5) and 15 (Chapter 16), we investigated the problem of denoising a smooth signal in 
additive white noise. In Lecture 4 (Chapter 5), we considered Lipschitz functions and showed that by filling 
constants on a uniform partition of width n" 1 / 3 we can achieve an n~ 2 ' 3 rate of MSE convergence. 

In Lecture 15 (Chapter 16), we considered Holder-a smooth functions, and we demonstrated that by 
automatically selecting partition width and using polynomial fits we can obtain a MSE convergence rate of 
n -2u/2a+i^ substantially better when a > 1. Also important is the fact that we don't need to know the value 

of a a priori. The estimator f n is fundamentally different than its counterpart in Lecture 4 (Chapter 5). 

In both cases f n (t) is a linear function (polynomial on constant fit) of the data in each interval of the 
underlying partition. In Lecture 4 (Chapter 5), the partition was independent of the data, and so the overall 
estimator is a linear function of the data . 

However, in Lecture 15 (Chapter 16) the partition itself was selected based on the data. Consequently, 

/„ (t) is a non-linear function of the data . Linear estimators (linear functions of the data) cannot adapt to 
unknown degrees of smoothness. In this lecture, we lay the groundwork for one more important extension 
in the denoising application - spatial adaptivity. That is, we would like to construct estimators that not 
only adapt to unknown degrees of global smoothness, but that also adapt to spatially varying degrees of 
smoothness. 

We will focus on the approximation theoretic aspects of the problem in this lecture, considering tree- 
based approximations and wavelet expansions. In the next lecture (Chapter 21), we will apply these results 
to the denoising problem, this will bring us up to date with the current state-of-the-art in denoising and 
non-parametric estimation. 

Recall that Holder spaces contain smooth functions that are well approximated with polynomials or 
piecewise polynomial functions. Holder spaces are quite large and contain many interesting signals. However, 
Holder spaces are still inadequate in many applications. Often, we encounter functions that are not smooth 
everywhere; they contain discontinuities, jumps, spikes, etc. Indeed, the "singularities" (or non-smooth 
points) can be the most interesting and informative aspects of the functions. 

Example 17.1 

Functions not smooth everywhere. 



1 This content is available online at <http://cnx.Org/content/ml6278/l.3/>. 
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spike 




otherwise smooth 



(a) 




smoothly varyingn intensiy 



except for edges 



(b) 
Figure 17.1: Example of functions not smooth everywhere, (a) 1-D Case (b) 2-D Case 



Furthermore, functions of interest may possess different degrees of smoothness in different re- 
gions. 

Example 17.2 

Functions with different degrees of smoothness. 
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m 




(a) 




H ffi , a < 1 



(b) 
Figure 17.2: Example of functions having different degrees of smoothness, (a) 1-D Case (b) 2-D Case 



17.2 NonLinear Approximation via Trees 

Let B a (C a ) denote the set of all functions that are H a (C a ) everywhere except on a set of measure zero. To 
simplify the notation, we won't explicitly identify the domain (e.g., [0, 1] or [0, 1] ); that will be clear from 
the context. 

Example 17.3: Sets of measure zero 
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m 




■+■ 



a point has measure 
zero in 1 -D 

(a) 




a smooth curve has 
measure zero in 2-D 



(b) 



Figure 17.3: Sets of measure zero, (a) 1-D Case (b) 2-D Case 



Let's consider a 1-D case first. 

Let / G B a (C a ) and consider approximating / by a piecewise polynomial function on a uniform 
partition. 

If / is Holder-a smooth everywhere, then by using an appropriate partition width fc _1 and 
fitting degree \a] polynomials on each interval we have an approximation ff. satisfying 



and 



\f(t)-f k (t)\<c a k- a 
o (k- 2a ) . 



11/ - fk\\h 



(17.1) 
(17.2) 



so 
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■i — i — i — i — i — i — i — i — i— y 

1/2 t 



Figure 17.4: Smooth curve with a discontinuity. 



However, if there is a discontinuity then for t in the interval containing the discontinuity the 
difference 



1/ (*)-/*(*) I 



(17.3) 



will not be small. 



Example 17.4 

Suppose / is piecewise Lipschitz and fk ia a piecewise constant. 
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a 



I I I I I I I I I 



s-y-4 



Figure 17.5 



|/(i) -/*(*)!« A (17.4) 

where A is a constant equal to average of / on right and left side of discontinuity in this interval. 



ll/-Mli a = o(*- 1 ) 



(17.5) 



where k~ l is the width of the interval. Notice this rate is quite slow. 

This problem naturally suggests the following remedy: use very small intervals near discontinu- 
ities and larger intervals in smooth regions. Specifically, suppose we use intervals of width k 



-2a 



to 



contain the discontinuities and the intervals of width k l elsewhere. Then accordingly piecewise 
polynomial approximation //. satisfies 



\\f-Ml 2 = o(k-^) 



(17.6) 



We can accomplish this need for "adaptive resolution" or "multiresolution" using recursive parti- 
tions and trees. 



17.3 Recursive Dyadic Partitions 

We discussed this idea already in our examination of classification trees. Here is the basic idea again, 
graphically. 
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1/4 
H \- 







1/8 



1/2 



4 h 



l 



I I I I I I I I I 

o 1 

complete RDP 




corresponding tree: 



I I I 



pruned RDP 




corresponding tree 



Figure 17.6: Complete and pruned RDP along with their correspnding tree structures. 



Consider a function / e B a (C a ) that contains no more than m points of discontinuity, and is H a (C a ) 
away from these points. 

Lemma 17.1: 

Consider a complete RDP with n intervals, then there exists an associated pruned RDP with 
O (klogn) intervals, such that an associated piecewise degree \a] polynomial approximation (f) k , 
has a squared approximation error of O (min (fc _2Q ,n -1 )). 
Proof: 

Assume n > k > m. Divide [0, 1] into k intervals. If / is smooth on a particular interval I, then 

\f(t)-f k (t)\ = 0(k- 2a )vtel. (17.7) 

In intervals that contain a discontinuity, recursively subdivide into two until the discontinuity is 
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contained in an interval of width n~ l . This process results in at most log^n addition subintervals 
per discontinuity, and the squared approximation error is O (k — 2a) on all of them accept the m 
intervals of width n _1 containing the discontinuities where the error is O (1) at each point. 
Thus, the overall squared L^ norm is 

\\f-h\\l 2 = 0{min{k- 2a ,n- 1 )) (17.8) 

and there are at most k + login intervals in the partition. Since k>m, we can upperbound the 
number of intervals by Iklog^n. 

Note that if the initial complete RDP has n w k 2a intervals, then the squared error is O (fc~ 2Q ). 

Thus, we only incur a factor of 2alogk additional leafs and achieve the same overall approxima- 
tion error as in the H a (C a ) case. We will see that this is a small price to pay in order to handle 
not only smooth functions, but also piecewise smooth functions. 



17.4 Wavelet Approximations 

Let f eL 2 ([0, 1]); / f 2 (t) dt < oo. 

A wavelet approximation is a series of the form 



2 3 
j>0 k=l 



where c is a constant I c = J Q f (t) dt 



<f,tpj, k > = [ f{t)il> i<k {t)dt (17.10) 

Jo 

and the basis functions tpj^ are orthonormal, oscillatory signals, each with an associated scale 2 _J and 
position k2~ 3 . tpjj, is called the wavelet at scale 2~^ and position k2~K 

Example 17.5: Haar Wavelets 

ipj,k (t) = 2 j/2 (l{te[2-j(fc-i),2-j(fe-i/2)]} ~ 1 {te[2-3(fe-i/2),2-j/ s ]}) (17-H) 



$££$ 



2 j/2 
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Figure 17.7: Haar Wavelet 



k2"J" 



(k-l)2 



-j 



t 



* 



%l) itk (t) eft = 



1 /.fc2" 

»'(fe-l)2-J 



2 J d£ = 1 



V^fc (t) i/ 1 !,™ (*) eft = 5j,j.5fe,. 



(17.12) 
(17.13) 
(17.14) 



Note: If / is constant on [2 J ' (k — 1) , 2 J fc] , then 



/^•,fc (t) = 0. 



Suppose / is piecewise constant with at most m discontinuities. Let 

.7-1 2 j 

fj = c o + ^2^2< f, i>j,k > i>3,k- 
j=0 fe=l 



(17.15) 



(17.16) 



124 CHAPTER 17. NONLINEAR APPROXIMATION AND WAVELET ANALYSIS 

Then, fj has at most mJ non-zero wavelet coefficients; i.e., < f,ipj.k > = for all but mJ terms, 
since at most one Haar Wavelet at each scale senses each point of discontinuity. Said another way, 
all but at most m of the wavelets at each scale have support over constant regions of /. 

fj itself will be piecewise constant with discontinuities only possible occurring at end points of 
the intervals [2~ J (k — 1) , 2 _,7 fc] . Therefore, in this case 

||/ -MIL =0(2-0. (17.17) 

Daubechies wavelets are the extension of the Haar wavelet idea. Haar wavelets have one "vanishing 
moment": 

l 

1>j,k = 0. (17.18) 

o 

Daubechies wavelets are "smoother" basis functions with extra vanishing moments. The 
Daubechies- N wavelet has N vanishing moments. 

l 
t l ip j:k dt = Oforl = 0, 1, ..., N - 1. (17.19) 

o 

The Daubechies-1 wavelet is just the Haar case. 

If / is a piecewise degree < N polynomial with at most m pieces, then using the Daubechies-iV 
wavelet system. 

ll/-/j|l! 2 = 0(2- J ); (17.20) 

and 

,7-1 V 

fj (t) = c + J2 J2 < f> ^,k > 1>i,k (t) (17.21) 

j=0 fe=l 

has at most O (mJ) non-zero wavelet coefficients, fj is called the Discrete Wavelet Transform 
(DWT) approximation of /. The key idea is the same as we saw with trees. 

17.5 Sampled Data 

We can also use DWT's to analyze and represent discrete, sampled functions. Suppose, 

/ = [/(l/n),/(2/n),..,/(n/n)] (17.22) 



then we can write / as 



log 2 n—l 2^ 

l= c o+ E J2<l^t j , k >t j , k (17-23) 



j=0 fc=l 



where 



^, fc = [^-,fc(l),^-,fc(2),...,^,fc(n)] (17.24) 



-3 

is a discrete time analog of the continuous time wavelets we considered before. In particular 



J2 i%,k (0 = 0, 1 = 0, 1, ..., N - 1 (17.25) 



for the Daubechies-iV discrete wavelets. 



<L± ] , k > =1 ± jtk 
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(17.26) 



Thus, we also have an analogous approximation result: If / are samples from a piecewise degree < N poly- 
nomial function with a finite number m of discontinuities, then / has O (mJ) non-zero wavelet coefficients. 

17.6 Approximating functions with wavelets 

Suppose / € B a (C a ) and has a finite number of discontinuities. Let f p denote piecewise degree- N (N = \a]) 
polynomial approximation to / with O (fc) pieces; a uniform partition into k equal length intervals followed 
by addition splits at the points of discontinuity. 



m 



1 1 1 H-H 1 1 1 1— > 







1/2 



extra break pt at discontinuity 



Figure 17.8 



1 



Then 



\f(t)-f P (t)\ 2 = o(k(-^)\/te [0,1] 
\f(i/n)-f p (i/n)\ 2 = 0(k- 2a )i = l,... 



l/n||/-/ p ||| 2 = 0(fc- 2a ) 



(17.27) 
(17.28) 
(17.29) 
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and / has O (klog2n) non-zero coefficients according to our previous analysis. 

17.7 Wavelets in 2-D 

Suppose / is a 2-D image that is piecewise polynomial: 




Figure 17.9 



A pruned RDP of fc squares decorated with polyfits gives 

(fe- 



ll/ ~~ fk\\L 2 



(17.30) 
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I: — 




e ** ^ 











resolution 1/k 
sidelength along edge 



Figure 17.10 



Let /= [f{i/k,j/k)" J=1 sample range. 



fn{t)= 2_^ f(i/k,j/kk) l{te[i-l/k,i/k)x{j-l/k,j/k)} 



then 



(17.31) 



11/ - /-lli 2 = C> (fc- 1 ) (17.32) 

O (1) error on k of the k 2 pixels, near zero elsewhere. The DWT of / has O (k) non-zero wavelet coefficients. 
0(2 j ) at scale 2~ j ,j = 0, 1, ...,logn. 
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Chapter 18 

Vapnik-Chervonenkis Theory 1 



18.1 Review of Past Lecture 

In our past lectures we considered collections of candidate function T that were either finite or enumerable. 
We then constructed penalties, usually codelengths, for each candidate c (/), / € JF, such that ^2 f^j^^ — 1 
This allowed us to derive uniform concentration inequalities over the entire set T using the union bound. 
However, in many cases the collections T may be uncountably infinite. A simple example is the collection 
J 7 of a single threshold classifier in 1-d having the form 

ft (x) = l {x > t} (18.1) 

and their complements 

/.(*) = !{*<.}■ (18-2) 

Thus, T contains an uncountable number of classifiers, and we cannot apply the union bound argument in 
such cases. 

18.2 Two Ways to Proceed 

18.2.1 Discretize or Quantize the Collection 

Example 18.1 

To quantize T 

F q = {f, f (x) = l{x<i/? i »e{o,i,...,g}}} (18-3) 

q is positive, such that \/f q s T q 



\f-f q \<c/q (18.4) 

if the density of x is bounded by c > 0. q < n 1 ' 2 . 



1 This content is available online at <http://cnx.Org/content/ml6284/l.2/>. 
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18.2.2 Identical Empirical Errors 

Consider the fact that given only n training data, many of the classifiers in such a collection may produce 
identical empirical errors. Also, many / G T will produce identical label assignments on the data. We will 
have at most 2™ unique labels. 

f is uncountable, its interceptions are countable and bounded by 2™. n intervals with 2 classifier per 
interval. 

The number of distinct labeling assignments that a class T can produce on a set of n points is denoted 

S(T,n)<T (18.5) 

The VC dimension is logS (J 7 , n). Specifically, VC (J 7 ) = k, where k is largest integer such that S (J 7 , k) = 2 fe 
Ex. In = 2", n = 2, VC {J 7 ) = 2. 
Ex. Consider 

J 7 ={f:f(x) = l {x > t} orf (x) = l {x<t} ,t G [0, 1]} (18.6) 

Let q be a positive integer and 

F q = {/ : f{x) = l{ x > l/q }orf{x) = l {x<l / q} ,i G {0,l,...,g}} (18.7) 

and, 

|/,| = 2(g+l). (18.8) 

Moreover, for any / G J 7 there exists an j\ G T q such that 

\f(x)-f q (x)\dx< f ldx=l/q. (18.9) 

J(i-l)/q 

Now suppose we have n training data and suppose /* G T . We know that in general, the minimum empirical 
risk classifier will converge to the Bayes classifier at the rate of n -1 / 2 or slower. Therefore, it is unnecessary 
to drive the approximation error down faster than nT 1 / 2 So, we can restrict our attention of T n -\n and, 
provided that the density of x is bound above. We have 

min f ^ n _ 1/2 R(f) -R(f*) < C fq min J \f* (x) - f (x) \dx < c/n 1 ' 2 . (18.10) 

Vapnik-Chervonenkis theory is based not on explicitly quantizing the collection of candidate functions, but 
rather on recognizing that the richness of T is limited in a certain sense by the number of training data. 
Indeed, given n i.i.d. training data, there are at most 2" different binary labelings. Therefore, any collection 
T may be divided into 2™ subsets of classifiers that are "equilvalent" with respect to the training data. In 
many cases a collection may not even be capable of producing 2" different labellings. 

18.3 Example 

Consider X = [0,1]. 

F = {/ : / (x) = l {x >t } orf (x) = l {x<t] t e [0, 1]} (18.11) 

Suppose we have n training data: (xi, ..., x n ) e [0, 1]. With x s denotes the location of each training point 
in [0,1]. Associated with each x is a label y s {0, 1}. Any classifier in T will label all points to the left of a 
number t G [0, 1] as "1" or "0", and points to the right as "0" or "1", respectively. For t G [0,xi), all points 
are either labelled "0" or "1". For t G (£1,2:2), x\ is labelled "0" or "1" and X2---x n are label "1" or "0" and 
so on. We see that there are exactly 2n different labellings; far less than 2"! 
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The number of different labellings that a class T can produce on a set of n training data is a measure of 
the "effective size" of T. The Vapnik-Chervonenkis (VC) dimension of T is proportional to the log of the 
effective size. Let V (J- ', n) denote the VC dimension of T, typically a constant, independent of n. The VC 
inequality states that for all / s T 

P (\Rn (/) -R(f)\>e)< 8e V ^) e -ne 2 /32^ (lg u) 

This type of uniform concentration inequality can be used in a similar fashion to our use of Hoeffding's 
inequality plus union bound. 

18.4 Hyperplane Classifiers 

We will go into the details of VC Theory next lecture (Chapter 18), and the remainder of this lecture will 
introduce the key ideas with an example Consider the following setup. Let X = [0, 1] , Y = {0, 1} Let 

F = {/ : / (x) = l-{w T x+wo>o}} (18.13) 

with wo and wE R d+1 This is the collection of all hyperplane classifiers. T is infinite and uncountable. 
Suppose that we have n training data 

{*i, *}?=!■ (18-14) 

There are at most 2 (^) unique classifiers in T with respect to these data. To see this, consider d arbitrary 
data points x\, ...,Xi d , and let w T x + wo > be a hyperplane containing these points. To be specific, take 
the hyperplane with 

|Kw|| = l- (18-15) 

this hyperplane coincides with two possible classification rules: 

/i i x ) = 1 {w T x+w >o} (18.16) 

fl (x) = l{w T x+w„<0} (18.17) 

Each d-tuple of training data produces two distinct classifiers, assuming the data are not co-linear. Thus, 
there are at most 2 * (^) unique classifiers in T with respect to the training data. (All other / g T produce 
the same labels and empirical risk as one of the classifiers.) Let's enumerate the unique hyperplane classifiers 

/i,-,/ 2 *(2). and let 

f n = arg min R n (f) (18.18) 

fe{fu-M:)} 

and let 

R* =inf feF R{f) (18.19) 

and define 

f* = argmin fer R{f) (18.20) 

If multiple / € T achieve R* , pick /* to be one of them in an arbitrary fixed number. 
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Theorem 18.1: 

Assume that P x has a density, but that the distribution of (x, y) is other arbitrary. If n > d and 
2d/n <e<l then 

P [r ( /„ ) -R(f)>e)< e 2<te (2 (") + l) e~^/ 2 . (18.21) 



Note: The assumption that P x has a density insures that no d+1 points are co-planar. This in 
turn, guarantees that there are exactly 2 (^) unique classifier and that the 2 (^) under consideration 
are fully representative of all possible classifiers in T, with respect to the data. 

Proof: 

The proof is a specialization of the basic ingredients of VC Theory to the case at hand. Here we 
follow the proof in DGL '96. First we note that, 

RUaj ~ R(fl = RUn) ~ Rn Unj + Rn (f n ) ~ R(fl (18-22) 

<RUn) -Rn(f n )+Rnr-R(n + d/n (18.23) 

and since R n I f n J < R n (f) + d/n for any / € T 

< "W**i=i,...,2(;) l R (fi) ~ C R ) (fi) ) + { R ) (/*) - R (f*) + d/n.(18.24) 

V n / n 

Therefore, by the union bound: 

pU(f n )-R(n>A (18.25) 

2(2) . 

< Y, p ( R (fi) - Rn (fi) > e/2 J + ^P f Rn (f*)-R (/* ) + d/n > e/2 J ■ (18.26) 

We can bound the second term of the above bound using Chernoff's/Hoeffding's inequality: 



PlR n (r)-R(f*)>e/2-d/n) (18.27) 

< e 2de e- n£2 / 2 . (18.29) 

Next, let's bound one of the terms in the summation. For example, take 



P[R(fi)-Rn(fi)> (e/2) J. (18.30) 

Note that by symmetry all 2 (") terms will have identical bounds. Since the bounds are indepen- 
dent of P x y. 
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Assume that /i is determined by the first d data points x±4, ...,Xd- By the smoothing property 
of expectations we can write, 



P[R(fi)-Rn(f)>e/2) =E 



P[R(fi)-Rn(fi) >e/2\xi,...,x d 



(18.31) 



From here, we will bound the conditional probability inside the expectation. Let 
{X[, Yj") , ..., (X^,Y"J) be d additional random samples that are independent and identically dis- 
tributed as the data (Xi,Yij , ..., (Xd, Yd). {X'i,Y^}f_ 1 are often called the "ghost sample" since 
they are not actually observed. They are a fictious sample leads to a simple bound on the conditional 
probability. Define if i < d 



or if i > d 



( x i ) Y i ) — [ x i > Y i ) 



X t ,Y t )=(X i ,Y i ). 



(18.32) 



(18.33) 



That is, {X i ,Y i }f =1 agrees with our observed data on i>d, but the first d samples are replaced 
with the ghost sample. Then, 



P \R(fi) - Rnih) > e/2\xi,...,x d 

<p{R(fi)-l/n Y^ l h(xi)*Vi >e/2\x 1 ,...,x d \ 

\ i=d+l / 

/ n \ 

< P ( R(fi) - 1/n^l/j^)^ + d/n > e/2\xi,...,x d 
= p(R(f l )- (~r) (f 1 )>t/2-d/n\x 1 ,...,X d 



where, 



^(/i) = V"£i {/l(x ;w } - 



(18.34) 
(18.35) 

(18.36) 
(18.37) 

(18.38) 



Note that n ( R] (/i) is binomially distributed with mean R(fi) and it is independent of 
xi,...,Xd Therefore, 



P\ R{fi)-R n (fi)>e/2-d/n\x U -,x d 



(18.39) 



P\R(fi)-R n (fi)>t/2-d/n\x u ...,x d 

< e -2n(e/2-d/n) 2 

< e 2de e- n£2 / 2 . 



(18.40) 

(18.41) 
(18.42) 
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In conclusion, 



P[R[fnj-R*>e\ 

2 ( d ) 

( n )e 2d£ e - n£2 / 2 + e 2de e-" £2 / 2 
I I e-" £2 / 2 . 



< 2 



e 2d£ ( 2 



Lastly, Corollary If n > d, then 



£ 



•R /n - min fe ^R (/) 



< v/2(d+ l)(Zogn + 2)/r 



(18.43) 

(18.44) 
(18.45) 
(18.46) 

(18.47) 



Chapter 19 

The Vapnik-Chervonenkis Inequality 1 



19.1 The Vapnik-Chervonenkis Inequality 

The VC inequality is a powerful generalization of the bounds we obtained for the hyperplane classifier in the 
previous lecture (Chapter 21). The basic idea of the proof is quite similar. Before starting the inequality, 
we need to introduce the concept of shatter coefficients and VC dimension . 

19.2 Shatter Coefficients 

Let A be a collection of subsets of lZ d , definition : The n th shatter coefficient of A is defined by 



max 



S A {n)= d {{ Xl ,...,x n }f]A,AsA} . (19.1) 

X\ , ..., X n £ /v 

The shatter coefficients are a measure of the richness of the collection A. S A (n) is the largest number of 
different subsets of a set of n points that can be generated by intersecting the set with elements of A. 

Example 19.1 

In 1-d, Let A = {(— oo,t] , telZ} Possible subsets of {x\, ...,x n } generated by intersecting with sets 
of the form (— oo, t] are {xi, ..., x n }, {xi, ..., x n -i}, ..., {x\}, <j). Hence Sd (n) = n + 1. 

Example 19.2 

In 2-d, Let A = { all rectangles in 1Z 2 } 

Consider a set {xi, X2, x%, £4} of training points. If we arrange the four points into the corner 
of a diamond shape. It's easy to see that we can find a rectangle in 1Z 2 to cover any subsets of the 
four points as the above picture, i.e. S_a (4) = 2 4 = 16. 

Clearly, S A (n) = 2 n ,n = 1, 2, 3 as well. 

However, for n = 5,S_a_ (n) < 2 5 . This is because we can always select four points such that the 
rectangle, which just contains four of them, contains the other point. Consequently, we cannot find 
a rectangle classifier which contains the four outer points and does not contain the inner point as 
shown above. 

Note the S A <2 n . 

If |{{xi, ...,x n } P| A, A e A] I = 2™ then we say that A shatters xi, ..., x n . 



lr This content is available online at <http://cnx.Org/content/ml6283/l.2/>. 
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19.3 VC Dimension 

Definition 19.1: The VC dimension 

V A of a collection of sets A is defined as the largest interger n such that S A (n) = 2". 
Example 
A = {(— oo, t] ; telZ},S A = n + 1 hence V A = 1. 

Example 

A = { all rectangles in 1Z 2 }. 

S A = 2 n ,n= 1,2,3,4 and S A < 2 n ,n = 4, Hence V A = 4. 

The VC dimension provides a useful bound on the growth of the shatter coefficients. 



19.4 Sauer's Lemma: 

Let A be a collection of set with VC dimension V A < 00. Then Vn, S A (n) < Si=o I 1 1 a l so ^>-4 ( n ) — 

(n + lJ^.Vn. 

19.5 VC Dimension and Classifiers 

Let T be a collection of classifiers of the form / : lZ d — » {0, 1} Define .4 = {{x : / (x) = 1} x {0} [j{x : 
/ (x) = 0} x {1}, f e J 7 } In words, this is collection of subsets of X x y for which on feT maps the features 
x to a label opposite of y. The size of A expresses the richness of T ' . The larger A is the more likely it is 
that there exists an feT for which R(f) = P (f (X) ^ Y) is close to the Bayes risk R* = P (f* (X) ^ Y) 
where /* is the Bayes classifier. The n th shatter coefficient of T is defined as S? (n) = S A (n) and the VC 
dimesion of T is defined as Vjr = V A . 

Example 19.3 

linear (hyperplane) classifiers in lZ d 

Consider d = 2. Let n be the number of training points, it is easy to see that when n = 1, let A 
be as above. By using linear classifiers in 1Z 2 , it is easy to see that we can assign 1 to all possible 
subsets {{xi},(/>} and to their complements. Hence iSy- (1) = 2. 

When n = 2, we can also assign 1 to all possible subsets {{xi,X2}, {xi}, {2:2}, </>} and to their 
complements, and vice versa. Hence <Sjr (2) = 4 = 2 2 . 

When n = 3, we can arrange arrange the point x\, X2, X3(non-colinear) so that the set of linear 
classifiers shatters the three points, hence S^ (3) = 8 = 2 3 

When n = 4, no matter where the points xi, £2, %3, £4 and what designated binary values 
J/i) 2/2, J/3, 2/4 are. It's clear that A does not shatter the four points. To see the claim, first observe 
that the four points will form a 4-gon (if the four points are co-linear, or if the three points are 
co- linear then clearly linear classifiers cannot shatter the points). The two points that belong to 
the same diagonal lines form 2 groups and no linear classifier can assign different values to the 2 
groups. Hence S^ (4) < 16 = 2 4 and Vjr = 3. 

We state here without proving it that in general the class of linear classifiers in lZ d has Vj= = d+1. 
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19.6 The VC Inequality 

Let X\, ,..., X n be i.i.d. 7?. d -valued random variables. Denote the common distribution of X;,l < i < 
n by fi (A) = P(XisA) for any subset A C lZ d . Similarly, define the empirical distribution /j, n (A) = 

1 s-^n , 

Theorem 19.1: VC 71 

For any probability measure [i and collection of subsets A, and for any e > 0. 



P ! SUP \u n (A) - „ (A)\ > e | < 8S A (n) e~ ne * ^ 
AeA 



(19.2) 



and 



E 



sup 

AeA 



\» n (A)-n(A)\ 



< 2 



log2S A (n) 



(19.3) 



Before giving a proof to the theorem. We present a Corollary. 

Corollary 19.1: 

Let J 7 be a collection of classifiers of the form/ : lZ d — » {0, 1} with VC dimension Vjr < oo, Let 

R U) = p (f( x ) ¥= Y ) and RnU) = iT,i 1 {f(x i )^Y i }, where Xi,Y t ,l < i < n are i.i.d. with joint 
distributioni-xr- 
Define 

arqmin 

fn= Rn / ■ 

feT 
Then 



E 



R\fn 



inf R < 4 Vjrlog{n+l) + log2 
feT V n 



Proof: 

Let „4 = {{x : / (a;) = 1} x {0} U{^ : / (a:) = 0} x {1}, feT} 
Note that 



(19.4) 



P(f(X)^Y) = P((X,Y)eA):=ii(A) 

where A = {x : f (x) = 1} x {0} \J{x : f (x) = 0} x {1}. 
Similarly, 



(19.5) 



1 n i n 

1 1 

Therefore, according to the VC theorem. 



E 



sup 
feT 



Rn(f)-R(f) 



E 



sup 

AeA 



\li n (A)-n(A)\ 



< 2 



log2S A (n) 



log2S j= (n) 



(19.6) 



(19.7) 
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Since Vjr < oo, Sj: (n) < (n + 1) ^ and 

Rn(f)-R(f) 



E 



sup 
feT 



< 2 



V F log(n+ 1) + log2 



Next, note that 

inf 



R[fr, 



feT 



R(f) 



< 



< 



RUnJ -RnUn 
RUnj-RnUn) 

RUnJ-RnUnj 
SUV 

2 

feT 



I \ inf 



sup 
feT 

sup 
feT 

Rn (f)-R(f) 



RnUnj-R(f) 

Rn(f)-R(f) 



Therefore, 



E 



R\fn 



inf 

feT 



R(f) < 2E 



sup 
feT 



Rn(f)-R(f) 



< 



Vjrlog(n+l)+log2 



(19i 



(19.9) 



(19.10) 



Chapter 20 

Applications of VC Bound 1 

20.1 Linear Classifiers 

Suppose T= {linear classifiers in R d }, then we have 



V-p = d + 1, /„ = argminRn (/) 



£ 



R\f, 



infR(f)<4 



(d+ l)log(n+ 1) + log! 



(20.1) 
(20.2) 



20.2 Generalized Linear Classifiers 

Normally, we have a feature vector X s R d . A hyperplane in R d provides a linear classifier in R d . Nonlinear 
classifiers can be obtained by a straightforward generalization. 

Let fa,- ••,</>.■, d > d be a collection of functions mapping R d — > R. These functions, applied to a 
feature X s R d , produce a generalized set of features, cf> = (<f>i (X) , 02 (-X') , ■ ■ ■ , <j>d' [X)) . For example, if 
X = {x\,X2) , then we could consider d = S and <fi = (%i, %2, X1X2, x\,X2) G R 5 . We can then construct a 
linear classifier in the higher dimensional generalized feature space R d . 

The VC bounds immediately extend to this case, and we have for J 7 ' = { generalized linear classifiers 
based on maps cf> : R d — > R d }, 



E 



R\fn 



infR(f)<4 



(d! + l)log{n+ l) + log2 



(20.3) 



20.3 Half-Space Classifiers 



Theorem 20.1: Steele 75, Dudley 78 

Let (7be a finite-dimensional vector space of real- valued functions on R d . The class of sets 
A = {{x : g (x) > 0} : g s Q} has VC dimension > dim(Q). 



1 This content is available online at <http://cnx.Org/content/ml6262/l.2/>. 
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Proof: 

It is sufficient to show that no set of n = dim (Q) + 1 points can be shattered by A. Take any n 
points and for each g e Q, define the vector V g = (g (x\) , • • • ,g (x n )). 

The set {V g : g € Q} is a linear subspace of R™ of dimension < dim (Q) = n — 1. Therefore, 



there exists a non-zero vector 



(ai 



Y n ) e R™ such that J^iLi a *9 ( Xi ) = *-*• ^ e can assume 



that at least one of these af is negative (if all are positive, just negate the sum). We can then 
re-arrange this expression as J2i-. ai >o a i9 ( x i) = J2i-. ai< o ~ a t9 (xi)- 

Now suppose that there exists a g G Q such that the set {x : g (x) > 0} selects precisely the xf 
on the left-hand side above. Then all terms on the left are non-negative and all the terms on the 
right are non-positive. Since a is non-zero, this is a contradiction. Therefore, xi,- ■ ■ ,x n cannot be 
shattered by sets in {x : g (x) >0},g£ Q. 6.375pt0.0pt6.375pt 

Example 

Consider half-spaces in R d of the form A = {x s R d : Xi > b,i s {l,--- ,d},b s R}. Each 
half-space can be described by 



g(x) = [0,--- ,0,1,0,--- ,0] 



X\ 



X,J 



(20.4) 



dim(G) = d+l, V A <d+l. 



(20.5) 



20.4 Tree Classifiers 



Let 



Tfe = {recursive rectangular partitions of R with k + 1 cells} 



(20.6) 



Let T G 7fc. Each cell of T results from splitting a rectangular region into two smaller rectangles parallel to 
one of the coordinate axes. 

Example 20.1 

T e T 3 , d = 2. 

Each additional split is analogous to a half-space set. Therefore, each additional split can 
potentially shatter d + 1 points. This implies that 



V Tk <(d+l)k. 



(20.7) 



Example 20.2 

d= 1. 

fc = 1 split shatters two points. 

k = 2 splits shatters three points < 4. 
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20.5 VC Bound for Tree Classifiers 

Fk = {tree classifiers with k + 1 leafs on R } (20.8) 



E 



R\fr 



infR(f)<J {d+1)kl ° 9n + l ° 92 . (20.9) 

Ser k V n 



Exercise 20.1 (Solution on p. 143.) 

How can we decide what dimension to choose for a generalized linear classifier? 
How many leafs should be used for a classification tree? 

20.6 Structural Risk Minimization (SRM) 

SRM is simply complexity regularization using VC type bounds in place of Chernoff's bound or other 
concentration inequalities. 

The basic idea is to consider a sequence of sets of classifiers Fi,^,---, of increasing VC dimensions 
V^! 5= Vr 2 < •■•■ Then for each k = 1, 2, ... we find the minimum empirical risk classifier 

/„ =argminR n {f) (20.10) 

f£Fk 

and then select the final classifier according to 

~ /~(k) 

ABn I i 

fc>l 



k= argmm{R n I t n | + , / 32y ^ ^ ogn + ^ } (20.11) 



fe 
and f n =fn ' is the final choice. 
The basic rational is that we know 



fer k 



/?„ |/„ I - infn[f)<C K l^^ (20.12) 



where C is a constant. 
The end result is that 



E 



R[fn 



< miniminR (/) + 16a/ y » °f n + } (20.13) 

- fe>i x /e^ u; V 2n J v ; 



analogous to our pervious complexity regularization results, except that codelengths are replaced by VC 
dimensions. 

In order to prove the result we use the VC probability concentration bound and assume that A = 
J2 k>1 Vr k < oo. This enables a union bounding argument and leads to a risk bound of the form given above. 

20.7 Key Point of VC Theory 

Complexity of classes depends on richness (shattering capability) relative to a set of n arbitrary points. This 
allows us to effectively "quantize" collections of functions in a slightly data-dependent manner. 
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20.8 Application to Trees 



Let 



Then 



satisfies 



•7"fc = {k leaf decision trees in R }, Vjr k < (d+ 1) (k + 1) 



~(fc) 



/„ = argminR n (/) 



k= arqmin minR(f) 
fe>i \/e^ fc 



32 (d+ l)(fc- l)(%n+l) 



£" 



R\f, 



/„=/■ 



< min I minR(f) + 16 



(d+ l)(Jfc- l)/o 5 n + 4 



2». 



compare with 

from Lecture 11 (Chapter 12). 



< vnin vain R (/) 

/c>l I f ^dyadic k leaf trees 



(3fc - 1) lo#2 + i%n 
2n 



(20.14) 

(20.15) 
(20.16) 



(20.17) 



(20.18) 



(20.19) 
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Solutions to Exercises in Chapter 20 

Solution to Exercise 20.1 (p. 141) 

Complexity Regularization using VC bounds! 
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Chapter 21 

Lower Performance Bounds for 
Estimators 1 

21.1 Lower Performance Bounds 

In other modules, estimators/predictors are analyzed, in order to obtain upper bounds on their performance. 
These bounds are of the form: 



minE 



d[f n ,f 



< Cn' 1 (21.1) 



where 7 > 0. We would like to know if these bounds are tight, in the sense that there is no other estimator 
that is significantly better. To answer this, we need lower bounds like 



infsupK 
- far 

fn 

We assume we have the following ingredients: 



d[f n ,f 



> cn' 1 (21.2) 



*: Class of models, T C S. T is a class of models containing the "true" model and is a subset of some bigger 

class S. E.g. T could be the class of Lipschitz density functions or distributions Pxy satisfying the 

box-counting condition. 
*: An observation model, Vf, indexed by / s T . Vf denotes the distribution of the data under model /. 

E.g. in regression and classification, this is the distribution of Z = (X\,Y\, ■ ■ ■ , X n ,Y n ) C Z. We will 

assume that Vf is a probability measure on the measurable space (Z,B). 

*: A performance metric d (., .) . > 0. If you have a model estimate f n , then the performance of that model 
estimate relative to the true model / is d I /„, / J . E.g. 

v 2 \ V2 

Regression: d\f n ,f\ = \\f n - f\\, = j / [/„(*)-/ (x) ) dx I (21.3) 



Classification: d (/„,/)= R ( G n )- R* =/ - \2rj (x) - l\dP x (x) (21.4) 

G„AG' 




1 This content is available online at <http://cnx.Org/content/ml7357/l.3/>. 
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As before, we are interested in the risk of a learning rule, in particular the maximal risk given as: 



supEf 

fer 



d[f n J 



supld\f n (Z)J)dV f (Z) 



(21.5) 



where f n is a function of the observations Z and E/ denotes the expectation with respect to Vf. 
The main goal is to get results of the form 



lZ* n = infsupE, 

- f£T 



dlfnJ 



> cs r , 



(21.6) 



where c > and s n — » as n — > oo. The inf is taken over all estimators, i.e. all measurable functions 



f n :Z->S. 

Suppose we have shown that 



(A lower bound) 



(21.7) 



Urn inf s n 7l* n > c > 

n — >oo 

and also that for a particular estimator f n 

limsups- 1 supE f [d(f n ,f)]<C (21.8) 

n— >oo feF 

=4> Urn sup s~ l K* n < C, (21.9) 

n — >oo 

We say that s n is the optimal rate of convergence for this problem and that f n attains that rate. 



note: Two rates of convergence <£„ and ^' n are equivalent, i.e. ^>„ = & n iff 



< lim in f — r- < iiwi sun — r < oo 
vDr vDr 



(21.10) 



21.1.1 General Reduction Scheme 

Instead of directly bounding the expected performance, we are going to prove stronger probability bounds 
of the form 



infsupPf I d I f n , f I > s n \ > c> 



(21.11) 



These bounds can be readily converted to expected performance bounds using Markov's inequality: 



E, 



Vf[d\f n ,f\ >s n \< 



d[f n J 



Therefore it follows: 



infsupMf 
~ fer 

fn 



d[f n J 



> infsups n Pf I d I /„, / I > s n \ > cs n 

fn 



(21.12) 



(21.13) 
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21.1.1.1 First Reduction Step 

Reduce the original problem to an easier one by replacing the larger class T with a smaller finite class 
{/o, • • • , /m} Q 3~ '■ Observe that 

infsupVf(d(f n ,f) >s n ) >inf sup V f id (/„,/)> s n ) (21.14) 

- /£? \ \ / / /£{/o,-,/m} V V / / 

/„ /„ 

The key idea is to choose a finite collection of models such that the resulting problem is as hard as the 
original, otherwise the lower bound will not be tight. 

21.1.1.2 Second Reduction Step 

Next, we reduce the problem to a hypotheses test. Ideally, we would like to have something like 

infsupVf ( d (/„,/)> *„ ] > inf sup V f] I h n (Z) + j ) (21.15) 

~ fer \ \ J J ie{o,-,M} V J 

fn fn 

The inf is over all measurable test functions 

h n :Z^{0,---,M} (21.16) 

and Vf. h n {Z) / j denotes the probability that after observing the data, the test infers the wrong 

hypothesis. 

This might not always be true or easy to show, but in certain scenarios it can be done. Suppose d(., .) 
is a semi-distance, i.e. it satisfies 

(i): d(f,g) = d(g,f)>0 (Symmetric) 
(ii): 

d(f,f)=0 (21.17) 

(iii): d(f,g) < d(h, f) + d(h,g) (Triangle inequality) 



E.g. withf ig :R d ^R,d(f,g)^\\f-g 



, : / , i i 

Lemma 21.1: 

Suppose d(.,.) is a semi-distance. Also suppose that we have constructed $$,■■■ , fu s.t. 



d (fj, fk) > 2s„, Mj ^ k. Take any estimator f n and define the test: ^* o f n : Z — > {0, • • • , M} as 

**(/„)= argmind I f n , f 3 j (21.18) 

Then **(/„)/ j, implies d ( /„, fj j > *„. 

Suppose ** f /„ W 3 W+27FA] 3k ^ j : d I f n , f k ] < d I /„, fX Now 

2sn<d(f j ,f k )<d(f n ,f j )+d(f n ,f k ) <2d(f n ,fA (21.19) 
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=> d f n ,fj > s n 



(21.20) 



The previous lemma implies that 



v h [d[f n ,fA >s n \>v u ** /„Uj 



(21.21) 



Therefore, 



infsupV fj I d I /„, /j I > s„ 



> in/ mox P/. d f n ,fj > s r , 
~ /et/o.--- i/m > \ \ / 



(21.22) 



/„ 

~ je{o,-,M} ^ y 

A p 

— "e,M 

The third step follows since we are replacing the class of tests defined by *&* /„ by a larger 

class of ALL possible tests h n , an d hence the inf taken over the larger class is smaller. 
Now our goal throughout is going to be to find lower bounds for P e ,M- 

So we need to construct /o, • • ■ , /m s.t. d (fj, fk) > 2s„, j ^ k and P e ,M > c > 0. Observe that 
this requires careful construction since the first condition necessitates that fj and /j. are far from 
each other, while the second condition requires that fj and fk are close enough so that it is harder 
to distinguish them based on a given sample of data, and hence the probability of error P e ,M is 
bounded away from 0. 

We now try to lower bound the probability of error P e ,M- We first consider the case M = 1, 
corresponding to binary hypothesis testing. 

M = 1: Let Po and Pi denote the two probability measures, i.e. distributions of the data under models 
and 1. Clearly if P and Pi are very "close", then it is hard to distinguish the two hypotheses, and so P e i 
is large. 

A natural measure between probability measures is the total variation , defined as: 



V (Po, Pi) = sup\P Q (A) - Pi (A) | = sup\ f p Q (Z) - pi (Z) dv (Z) \ 

A A J A 



(21.23) 



where po and pi are the densities of Po an d Pi with respect to a common dominating measure v and A 
is any subset of the domain. We will lower bound the probability of error P e i using the total variation 
distance. But first, we establish the following lemma. 

Lemma 21.2: Scheffe's lemma 



V(P ,Pi) 



lJ\p Q {Z)-pi{Z)\du{Z)=lJ\p 
1 - / min(po,Pi) 



■Pi 



(21.24) 
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Recall the definition of the total variation distance: 



V(P ,P 1 ) = sup\ po-pi| (21.25) 

A J A 

Observe that the set A maximizing the right hand side is given by either {Z e Z : po (Z) > 
Pl (Z)}or{ZeZ: Pl (Z)>p Q (Z)}. 

Let us pick A = {Z e Z : p Q (Z) > p x (Z)}. Then 

V(P ,P 1 )= [ Po - Pl = -f p -pi = l [\p -Pi\ (21-26) 

J A J A% l J 

For the second part, notice that 

if » (Z) < Pi (Z) 

Po (Z)-min(p (Z), Pl (Z)) = { / " ( 2L27 ) 

Pa (Z)- Pl {Z) if Po (Z) > Pl (Z) 

Now consider 
1- fmin{p , Pl )= [po(Z)-min(p (Z),p 1 (Z))= f Po (Z) - Pl (Z) dv (Z) = V (P ,Pi) (21.28) 

J J J A„ 

We are now ready to tackle the lower bound on P e ,i- In this case, we consider all tests h n {Z) : 
Z — * {0, 1}. Equivalently, we can define hn (Z) = 1a {Z), where A is any subset of the domain. 

P e ,i = inf max Vj I hn =t J J > inf I 5P0 ( hn ¥= j + Pi (hn + 1 J J 

= ±infP (l A (Z)^0) + P 1 (l A (Z)^l) 

A 

|m/P (A) + Pi (.4 C ) 
|m/l - (Pi (A) - Po (A)) 

i(i-y(p ,Pi)) 

So if Po is close to Pi, then V (Po, Pi) is small and the probability of error P e \ is large. 

This is interesting, but unfortunately, it is hard to work with total variation, especially for multi- 
variate distributions. Bounds involving the Kullback-Leibler divergence are much more convenient. 

X(Pi||P )= flog^^- Pl (Z)d V (Z)= \log V -p x (21.30) 

J Po(Z) J po 

The following Lemma relates total variation, affinity and KL divergence. 

Lemma 21.3: 

l-y(Po,Pi)> |A 2 (P ,Pi)> \exp{-K(P x \\P )) 



(21.29) 
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For the first inequality, 



f / \Jmin {po,Pi) max (p , 



AHPo^i) = (/v»)' 

-\ 2 

■ l>\ ) / 

' J ^min(p ,Pi)\/max(j)o,pi)) 

< f min(po,pi) f max (po,pi) by Cauchy-Schwarz inequality \ ' 
= Jmin(p ,Pi) (2 - / min(p ,Pi)) -:f min( Po , P i)+f max( Po , Pl )=f Po +f P i=2 

< 2jmin(p ,pi) 

2(1-V(P ,P 1 )) 

For the second inequality, 

A 2 (P ,Pi) = (/0WT) 2 

= exp (log(J y/popi) J 

= exp {2log (/ y/popi) ) 

exp (2log (/ y/%Pi) ) (21.32) 

> exp I 2 / Zog ( ,/ — ) p\ ) by Jensen's inequality 

exp (- j log (^^j Pl ^j 
&cp{-K{Px\\P Q )) 

Putting everything together, we now have the following Theorem: 

Theorem 21.1: 

Let T be a class of models, and suppose we have observations Z distributed according to Vf, 

f € T. Let d /„,/ be the performance measure of the estimator f n (Z) relative to the true 
model /. Assume also d(.,.) is a semi-distance. Let /o,/i € T be s.t. d(fo,fi) > 2s„. Then 

infsupVf [d /„, / > s n ) > inf max V fj [d /„, fj 1 > s n \ 
~ & \ \ J J ~ je{0 ' 1} V V / / (21.33) 

> ^(-^(PaIIP/,,)) 

How do we use this theorem? 

Choose /o, /i such that K (Pi||Po) < a, then P e i is bounded away from and we get a bound 

infsupVf (d(f n , A >*« >c>0 (21.34) 

or, after Markov's 



infsupEf 



d /„,/ 



> cs„ (21.35) 



To apply the theorem, we need to design /o,/i s.t. d(/o,/i) > 2s„ and ea;p(— if (P/i||P/ )) > 0. 
To reiterate, the design of /o, /i requires careful construction so as to balance the tradeoff between 
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the first condition which requires /o,/i to be far apart, and the second condition which requires 

fo, /i to be close to each other. 

Example 

Lets use this theorem in a problem we are familiar with. Let X € [0,1] and Y\X = x ~ 
Bernoulli (r\ (x)), where rj (x) = P (Y = 1\X = x). 

Suppose G* = [t*,l]. We proved that under these assumptions and an upper bound on the 
density of X, the Chernoff bounding technique yielded an expected error rate for ERM 



E 



R[G 



R* 



O 



logn 



(21.36) 



(0) 



,(i) 



Is this the best possible rate? 

Construct two models in the above class (denote it by V), Pxy an( ^ ^xy- For both take 
P x ~ Uniform ([0, 1]) and r/ (0) = 1/2 - a, r/ (1) = 1/2 + a{a > 0), so G* = 0, G\ = [0, 1]. 

We are interested in controlling the excess risk 



R\G 



R{G* 



\2 V (x)-l\dP x (x) 



G„AG* 



Note that if the true underlying model is either P X y or Pxy> we have: 



(21.37) 



Ri G 



Rj (G*) 



\2rjj (x) — l\dx = 2a 



dx = 2adA G n ,G 



G„AG" J G„AG; 



(21.38) 



Proposition 1 

<^A (•, •) is a semi-distance. 

It suffices to show that d(G 1 ,G 2 ) = d{G 2 ,G 1 ) > 0, d{G,G) = 0VG and d(G 1 ,G 2 ) < d{G 1 ,G 3 ) + 
d(G3,G 2 )- The first two statements are obvious. The last one (triangle inequality) follows from the fact 
that GiAG 2 C (G1AG3) U (G3AG2). 

Suppose this was not the case, then 3x : x e G1AG2 s.t. x ^ G1AG3 and x ^ G2AG3. In other words, 



x e (GiAG 2 ) n (GiAG 3 ) c n (G2AG3) 
Since SAT = (5 n T c ) U {S c n T), we have: 



(21.39) 



x e [(Gi n g§) u (Gf n G 2 )] n [(Gf u G 3 ) n (d u g§)] n [(g§ u g 3 ) n (G 2 u g§)] 
e [Gi n {G\ u G 3 ) n G^ n (G 2 u G§)] u [Gf n (d u g§) n G 2 n (G§ u g 3 )] 
e [Gi n G 3 n G 2 n G§] u [Gf n G§ n G 2 n G 3 ] 

S 0, a contradiction 

Lets look at the first reduction step: 

infsupP (r (GnJ ~ R (G*) > s n ) > mf max P 3 (rj (gu) - Rj (G*) > s n j 

G n G„ 



inf max Pj Ua G n ,G* > s n /2a 



(21.40) 



(21.41) 



So we can work out a bound on d& and then translate it to excess risk. 
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Lets apply Theorem 1 (Theorem 21.1, p. 150). Note that d A (Gq,GI) 



1 and let P = P ( xI,y u 



,x„j„ 



and P 1 = P 



(i) 



Xi,Yi : 



,X„,Y„- 



KiPiWPo 



Now pW x (Y 1 = l\X 1 ) = 1/2 
Bernoulli (1/2 + a). So we get: 

K(Pi\\Po) 



Ei 



v (1) 

<Og (0) 



(x 1 ,Y 1 ,-,x„,r n ) 



, (Xj.-n, 



,X„ ,Y"„ 



, P^, yi (^l.yi)-P^i,y„(^n,yn) 



ElliEi 
nEi 



a and Pyi x (Vi 



l|Xi) = 1/2 - 



(21.42) 



Also under model 1, Y\ 



< 



n [(1/2 + a) log^ + (1/2 - a) log 1 ^ 
n \2alog (1/2 + a) - lalog (1/2 - a)} 
2nalog\^ a 
2na ( 1/2+a - 

Alia I 1 /2_ a 



4na 



Let a = 1/v/n and n > 16, then if(Pi||P ) < 4n£ ^ 



l/2-o 

< 16. 



1/2-l/vAi 

Using Theorem 1 (Theorem 21.1, p. 150), since c?a (Gq,G*) = 1, we get: 



(21.43) 



infmaxPj d A Gn,G* > 1/2 > -e 



1 



-16 



G„ 



Taking s n = 1/y/n, this implies 



infsupP I R \ G r 
- pev \ 

G n 



R(G*) > l/y/n) > -e 



-16 



or, after Markov's inequality 



infsupM 
- pev 

Gr, 



AG 



i?(G* 



1 



16 



(21.44) 



(21.45) 



(21.46) 



Therefore, apart from the logn factor, ERM is getting the best possible performance. 

Reducing the initial problem to a binary hypothesis testing does not always work. Sometimes we need 
M hypotheses, with M — > oo as n — > oo. If this is the case, we have the following theorem: 

Theorem 2 Let M > 2. {/ , • • • , f M } e T be such that 

.: d(fj,fk) > 2s„, where d is a semi-distance. 

.: ^ E^i *" (^ll^o) < alogM, with < a < 1/8. 

Then 



infsupPf \d I /„, / I > s„ I > infmaxPj I d I /„,/,- I > s„ 

/m 



> 



1-fVM 



1 - 2a - 2 



ZogM 



>0 



(21.47) 
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We will use this theorem to show that the estimator of Lecture 4 (Chapter 5) is optimal. Recall the setup 
of Lecture 4 (Chapter 5). Let 



T={f:\f(t)-f(s)\<L\t-s\it,s} 
i.e. the class of Lipschitz functions with constant L. Let 

X{ = i/n, i = 1, ■ ■ ■ ,n 



(21.48) 
(21.49) 



Yi = f ( Xi ) + Wi (21.50) 

E [Wi] = 0, E \Wf\ = a 2 < oo, Wi, Wj are indepedent if i ^ j. In that lecture, we constructed an estimator 

/ such that 



supM 



\\fn-f\r 



O (n- 2 / 3 ) 



(21.51) 



Is this the best we can do? 

We are going to construct a collection / , • • • , fu € J 7 an d a Pply Theorem 2. Notice that the metric of 

interest is d I /„,/ I = ||/„ — /||, a semi-distance. Let Wi ~ AT (0, a 2 ). Let m £ N, h = 1/m and define 

K (x) = (^ - L\x\j l\ x \< h /2 =\\ h ~ 2s|I|*|<fc/2 (21-52) 
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K(x) 




x 



■h/2 



h/2 



Figure 21.1 



Note that \K (a) — K (b)\ < L\a — b\, Va, b. The subclass we are going to consider are functions of the 
form 
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f_w(x) 






Figure 21.2 



i.e. "bump" functions. Let 0. = {0, l} m be the collection of binary vectors of length m, e.g. w 
(1,0,1,- •• ,0) G n. Define 



fw (x) = ^ WiK 



x (2i 



Note that for w,w' G O, 

" \Jwi Jw' I = I IJuj — J« 



(Jo E^i («>i-^) a ^(»- I (2» -l))) 

\/pJw/w)\/ J K 2 (x) dx 



1/2 



where p (w, wj is the Hamming distance, p (w, w 

r h/2 

jK*(x) 

so 



Em I 
i=l l W * - 

L 2 x 2 dx = 2L 2 



h*_ 
3-8 



E"il w »- w il- Now 



L2 

12' 



d(fw,fw-) = \P( W ' W ')-Fp; h3/ 



(21.53) 



(21.54) 



(21.55) 



(21.56) 

V J^ 

Since |f2| = 2™, the number of functions in our class is 2™. Turns out, we do not need to consider all 
functions f w ,w G CI, but only a select few. Using all the functions leads to a looser lower bound of the form 
nT l , which corresponds to the parametric rate. The problem under consideration is non-parametric, and 
hence we expect a slower rate of convergence. To get a tighter lower bound, the following result is of use: 
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Lemma 21.4: Varshamov-Gilbert '62 

Let m > 8. There exists a subset {w {0 \- ■ ■ , w (M) } of ft such that w (0) = (0, 0, • ■ • , 0), 



( W W) ,«;(*>) 



W.»W > 



8 



V0 < j < fc < M and M > 2 m/8 . 



What this lemma says is that there are many (~ 2 m ) sequences in ft that are very different 



P{ 



,(o) 



m). We are going to use the lemma to construct a useful set of hypotheses. 
7 ( M )} be the class of sequences in the lemma and define 



(21.57) 

(i.e. 
Let 



fj = f w uu je{o,---,M} 

We now need to look at the conditions of Theorem 2 and choose m appropriately. 
First note that for j ^ k, 



(21.58) 



d{fi,fk) = yJp(wU), w W)-==hW > 



L 
V2 

Now let Pj = P^\.. Yml j e {0, • • • , M}. Then 



171 L 



8 ^12 



-3/2 



L 

— 1=1 
4^6 



(21.59) 



K(Pj\\Po) 



E, 



7 *Vi y„ 

log (o) 



U)y. 

= Er=i E i ^^Tof 

Py 1 

< _L- X^ n (MiS 1 - 
— 2a 2 l^i=l \ 2 ) ~ 






-nh 



2 — J^nm 2 

8<X 2 



Now notice that logM > ^log2 (from Lemma ). We want to choose m such that 



1 M t2 

— > K (Pj\\P ) < — T nra" 2 < a — loq2 < aloqM 

M ^ \ 3 \\ ») - 8cr 2 8 u - y 



3 = 1 



This gives 



(21.60) 



(21.61) 



m > 



/ r 2 \ 1/3 



so take m = [C n ' + lj . Now 



(21.62) 



Therefore, 



d(fj,fk)> 



L 

— 1=1 
4^6 



1 > 2const n ' for n > uq (const) 



(21.63) 



infsupPf ( ||/„ - f\\> const n~ 1/3 I > c> 



(21.64) 



infsupPf \\f n - f\\ 2 > const n~ 2/3 > c > 
- /e^ V " / 



or after Markov's inequality, 



(21.65) 



infsupMf 



n/„-/ir 



> c • const n 



-2/3 



(21.66) 
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Therefore, the estimator constructed in class attains the optimal rate of convergence. 
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Glossary 



( (Bayes' Risk) 

The Bayes' risk is the infimum of the risk for all classifiers: 



R* = infR(f). (3.4) 

/ 

We can prove that the Bayes risk is achieved by the Bayes classifier. 

B Bayes Classifier 

The Bayes classifier is the following mapping: 

1, n(x)> 1/2 
/*(*) = { (3-5) 

0, otherwise 

where 

n(x) = P Ylx (Y = l\X = x). (3.6) 

Note that for any x, f* (x) is the value of y e {0, 1} that maximizes Pxy (Y = y\X = x). 



E Empirical Risk 



Let {Xi, Yi}™ =1 ~ Pxy be a collection of training data. Then the empirical risk is defined as 



n 

Rn{f) = -y j t{f{X i ),Y i ). (3.22) 



n 



Empirical risk minimization is the process of choosing a learning rule which minimizes the 
empirical risk; i.e., 

fn = argminRn (/) . (3.23) 

P Prefix Code 

A code is called a prefix code if no codeword is a prefix of any other codeword. 

Example: From Cover & Thomas '91Consider an alphabet of symbols, say A, B, C, and D and 
the codebooks below 

This is an unsupported media type. To view, please see http://cnx.org/content/ml6271/latest/ 

Figure 10.1 



In the singular codebook we assign the same codeword to each symbol - a system that is 
obviously flawed! In the second case, the codes are not singular but the codeword 010 could 
represent B or CA or AD. Hence it is not a uniquely decodable codebook. 
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The third and fourth cases are both examples of uniquely decodable codebooks, but the fourth 
has the added feature that no codeword is a prefix of another. Prefix codes can be decoded from 
left to right since each codeword is "self-punctuating" - in this case with a zero to indicate the 
end of each word. 

To design a uniquely decodable codebook in general is as challenging as the problem of selecting 
c(/) to satisfy 

J2 e' c{f) < oo. (10.17) 

However, prefix codes can often be easily designed or specified and they are inherently 
decodable. Moreover, prefix codes satisfy an important inequality called the Kraft Inequality . 

T The VC dimension 

Va of a collection of sets A is defined as the largest interger n such that S A (n) = 2 n . 

Example: A = {(— oo, t] ; telZ},S A = n + 1 hence V A = 1. 

Example: A = { all rectangles in 1Z 2 }. 

S A = 2 n ,n= 1,2,3,4 and S A < 2",n = 4, Hence V A = 4. 

The VC dimension provides a useful bound on the growth of the shatter coefficients. 
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